Re: Comments on: "Character Set" Considered Harmful

Bob Jung (bobj@netscape.com)
Thu, 20 Apr 95 13:32:35 EDT

I second Amanda's appeal to address the pressing pragmatic issues at hand,
especially the labelling issue.

At 10:31 AM 4/17/95, Amanda Walker wrote:
>...
>[This conversation is getting oddly neo-Platonic for an IETF working group :)]
>...
>I am, rather, concerned with a small set
>of pressing pragmatic issues. Principal among them is simply being able to
>determine unambiguously what characters are being represented in an HTML
>document so that I can display them. This is mostly a labelling issue,
>...
>
>The status quo in this regard is broken. As anyone who has tried to implement
>Japanese support in their browser can confirm, there is a lot of content out
>there whose interpretation cannot be determined unambiguously by software.
>This is bad.

Yes! Labelling is something we need to resolve ASAP. In Netscape's upcoming
releases, our HTTP server can add the MIME charset parameter to the
HTTP Content-Type header

Content-Type: text/html; charset=iso-2022-jp

and our client will parse the charset parameter and do the corresponding
code conversions and font selections. From the discussions in the http-wg,
this seems to be the direction we're heading for the HTTP spec.

While this helps content providers to get their documents rendered correctly,
we do not see this as a total solution. We need a way to label within HTML,
so that documents can be self-labeling and easier for content developers to
add this info.

>To give a concrete example, the Macintosh on which I am typing this message
>can handle multilingual text just fine. At the moment, it has fonts & input
>methods installed for European, Russian, Hebrew, Arabic, and Japanese.

As some of you are probably aware, in Mac files the data is tagged by using a
notion of string runs. Each run can be associated with style info such as the
font used. We could consider a similar concept for HTML to solve the labelling
problem. I'm open for discussion.

>There are HTML documents in existence that contain content in one or more
>of these.

Almost all HTML I've seen has been in a single encoding.

>All I want right now is some method for determining how to match them up. So
>far, what we do is cheat. ISO 2022 is easy to automatically detect even in
>mislabeled text, and is reasonably popular, so we've started with Japanese.
>There's only so far we can go with clever inferences, though.

And none of these clever techniques is 100% deterministic...
And unfortunately, more and more Japanese Web data is in SJIS...

>I don't mind translating between the transport representation and IS 10646, so
>that the SGML layer only sees a sequence of IS 10646 code points. That's
>simple. What I do mind is endless discussion about the distinctions between
>characters, glyphs, codes, and the essential nature of reality, even though in
>other contexts I may care greatly about such issues. They simply do not
>address the issue at hand (which Gavin's proposal does, as I see it).
>
>I'm not trying to squelch anyone, I just think we're getting a bit far afield.
>
>Amanda Walker
>InterCon Systems Corporation

Regards,
Bob

--
Bob Jung        bobj@netscape.com       +1 415 528-2688, fax +1 415 528-4122
Netscape Communications Corp.   501 E. Middlefield      Mtn View, CA   94041