Re: Charsets: Problem statement/requirements?

Gavin Nicol (gtn@ebt.com)
Thu, 9 Feb 95 09:28:13 EST

>OK... so Larry M's edits go into the 2.0 RFC, which "solves" the
>charset problem -- i.e. the 2.0 RFC says fairly carefully what happens

I would prefer "postpones" or "places a bandaid on"...

>I'd like to start a discussion on what folks think the problem with
>character sets on the web is, exactly, and what are their requirements

I'd like to refer eveyone back to the paper
"http://www.acl.lanl.gov/HTML_WG/html-wg-94q4.messages/0576.html" and
in fact, I'll post an updated version early next week, containing more
of a complete solution, with deployment stategies.

> Russian-capable clients recognize 8859-n, and select appropriate fonts.
> ** What happens with existing clients?
> (Mosaic 2.4 will fail to recognize it as text/html at all,
> and put to "save to file" dialog. This is reasonable/reliable
> behavior, in my book)

Yes, though other answers (transcription, transliteration, machine
translation) are also possible (not perfect though...).

>** What should future clients do?
> I can't read russian. Most of the planet can't read russian.
> I won't pay extra for a client that deals with russian.
> It's not cost-effective to require the whole world to support
> russian character sets.

I tend to agree, though there are two ways of solving this problem:
not supporting Russian at all, or supporting a single character set in
which most current languages can be handled. The latter is not a
great deal harder than supporting ASCII.

>II. Slightly tricker case: Japanese, korean, and other character sets
>that won't fit in 8 bits. Same story: just use ISO-2022-JP, or
>Unicide-1-1-UTF-8.

Same story. Don't support them at all, or support a single character
set that hanles most current languages in an admirable manner.

>As an implementor, by now, I'm getting tired of supporting 147
>different fonts and character encodings. I'm starting to believe in
>Unicode. So I make the internals of my client unicode-based, and I
>distribute unicode fonts with my browser somehow (this is extra-tricky
>on X platforms).

X fonts are not all that hard. I have built my own more than once, and
used them as a base for a Unicode-based windowing system I'm writing
(as a little hack on the side).

There is an inexorible chain of logic, that will lead to to Unicode is
you wish to have multilingual support in an SGML based system. Indeed,
for almost *any* system, supporting Unicode is *far* cheaper than
supporting a potentially unlimited number of character sets and
encodings. Why burden the client with this support. Place the burden
on the suppliers of information to supply it in a manner fit for
consumption (though using my little diagrams one could easily
implement a browser capable of supporting *any* character set and
encoding).

>*** Could somebody post some sort of survey on the state of the art
>in Unicode font support? That would it take, for example, for NCSA
>Mosaic or Netscape to support Unicode fonts on X, Mac, and Windows
>platforms?

I would say I could do an X11 version fo Mosaic that supported Unicode
in 2 weeks (if I could work on it full time). I'm not sure about
Windows and Mac, but I believe they should also not be overly
difficult (I'm not qualified to talk about that...).

>I'd sure like it if folks would quit sending ISO-2022-JP, big5, and
>all these crazy encoding and just use Unicode.

Go read my paper. There is a need to support native encodingd, but
there is a more pressing need to have a lingua franca.

>** Should a client be able to influence the charset that the server
>uses to represent a document?
> My opinion: maybe. Some sort of Accept-Charset: parameter
> might be used to express preferences, but it shouldn't
> be a viloation to send some charset that the client didn't
> ask for, as long as you label it correctly.

I disagree. Given the ease of table driven conversion to Unicode (the
tables are publicly available waiting for inventive mids to put them
to use), I cannot see why a server cannot be required to send non-ISO
88559-1 data in Unicode. Given the ease of which Unicode support can
be added to WWW browsers, I cannot see a reason for not supporting
it. With those two, we eliminate the case where the server and client
*cannot* exchange data.

>As UTF-8 is sort of "ASCII-compatible," lazy clients can just show
>some sort of "this might look funky" dialog and display the stuff
>assuming ISO-646 or ISO-8859-1.

UTF-8 also has a bad side: it requires up to 6 bytes for each
Kanji. That is unacceptable. We *must* be able to UCS-2 as well, and
the HTTP people have agreed to loosen the MIME requirements to allow
this.

>IV. Getting Hairy: Hebrew or Arabic. Can the client infer the writing
>direction from the charset? Do we need HTML markup to represent
>language, writing direction, and/or diacritics? Does Unicode solve
>this problem somehow? I'm afraid I'm well beyond my area of expertise
>and understanding at this point.

In most cases, it can be understood using Unicode, but it is desirable
to have high-level tags in HTML (and I personally believe an encoding
needs to be defined which includes language hints because we should be
thinking not only of HTML here, but also SGML). I have already
proposed one such encoding which can be decoded at the lower level, or
parsed within SGML).

>V. Really Messy: Hebrew, Arabic, Japanese, and Chinese in the same
>document, using different writing directions in the same paragraph.
>
>How many applications really require this sort of thing? At this level
>of complexity, aren't we perhaps better off doing the typesetting on
>the server side and sending something like TeX dvi information (or
>perhaps Adobe PDF) across the wire?

While this will be a minority case, it will certainly be necessary at
some time. Imagine a comparison of writing systems, or a course
teaching Arabic to Chinese students.

>As far as I can tell, the charset= parameter monkey-business allowed
>by Larry M's edits enables applications through III (rules for
>constructing an appropriate SGML declaration and document entity for
>parsing could be clarified in the spec, but I think it works). I'm not
>sure IV and V can be expressed in HTML as specified.

The biggest problem with this is that again: it requires browser
writers to get it right. This is *much* easier to do with a single
character set. Also, it does not support multilingual documents (other
than things like JIS).

>So I don't see any "character set problem" left in the HTML 2.0 spec.

Except that it doesn't supply a *standard* way of processing non-IS0
8859-1 documents (and III isn't handled well either), and has a
roman-centric view of the world.

One other nit: browsers are supposed to ignore tags they don't
recognise. The current HTML spec doesn't allow people to use anything
other than ASCII for *NMCHR, which seems somewhat arbitrary.