Re: Charsets: Problem statement/requirements?

Terry Allen (terry@ora.com)
Wed, 8 Feb 95 19:52:10 EST

Thanks for stirring the pot, Dan.

| Russian-capable clients recognize 8859-n, and select appropriate fonts.
| ** What happens with existing clients?
| (Mosaic 2.4 will fail to recognize it as text/html at all,
| and put to "save to file" dialog. This is reasonable/reliable
| behavior, in my book)

Yes, although it may not be the only suitable behavior.

| ** What should future clients do?
| I can't read russian. Most of the planet can't read russian.
| I won't pay extra for a client that deals with russian.
| It's not cost-effective to require the whole world to support
| russian character sets.
| If I were an engineer at NetScape, for example, I'd cringe at
| the thought of required suport for Russian fonts.

It's a simple alphabet, and although I don't read much Russian
myself, I believe there aren't any ligatures to mess you up. It
reads from left to right. So is support for Cyrillic much harder
than switching font tables (or whatever the relevant piece is)?

I can't say the same for all charsets, though; there will be some
that are not trivial to support. However, I think your argument
fails on another point: it *is* cost-effective to some degree to
construct a browser that can eat any ISO charset, so as to be able
to sell it to a wider market, without customization.

| The minimum acceptable behaviour, IMHO is the punt-and-save-to-file.
| This is true for _any_ character set. I don't think it's fair
| to say "you must support Latin1" though in practice, I think
| everybody will (or at least everyone will support US-ASCII,
| i.e. ISO-646).

Yes, you must support Latin 1. We just said so in HTML 2.0. It's
supposed to be current practice.

| A nice browser might check for ISO-8859-n fonts on your system,
| and use them if available; else, it might warn you "this will
| look funky" and perhaps offer to save to a file.

I'd like that behavior much better. Also, "will" should be "may", because
you can write ASCII in any of these charsets.

| II. Slightly tricker case: Japanese, korean, and other character sets
| that won't fit in 8 bits. Same story: just use ISO-2022-JP, or
| Unicide-1-1-UTF-8.

Any other charset in which the characters in the RCS (used for markup)
are in that same places as the ISO-8859 charsets could work just
as outlined above, couldn't it?

However, I agree that as a practical matter Unicode may be a
reasonable short-term solution, as you outline below. (Gavin,
here's some public support.)

| As an implementor, by now, I'm getting tired of supporting 147
| different fonts and character encodings. I'm starting to believe in

Best leave fonts out of this (well constructed) argument at this point.
The main issue is charsets; distributing fonts is going to be
necessary, but for other reasons too, and it's a side issue here.

| Unicode. So I make the internals of my client unicode-based, and I
| distribute unicode fonts with my browser somehow (this is extra-tricky
| on X platforms).
|
| *** Could somebody post some sort of survey on the state
| of the art in Unicode font support?
| What would it take, for example, for NCSA Mosaic or Netscape
| to support Unicode fonts on X, Mac, and Windows platforms?
| ***
|
| I'd sure like it if folks would quit sending ISO-2022-JP, big5, and
| all these crazy encoding and just use Unicode.

And if they won't? and insist on making browsers that handle them
AND the ISO Latin 1 you use yourself? you might give up and buy
a Chinese or Korean browser (or browser technology).

| ** Should a client be able to influence the charset that the server
| uses to represent a document?
| My opinion: maybe. Some sort of Accept-Charset: parameter
| might be used to express preferences, but it shouldn't
| be a viloation to send some charset that the client didn't
| ask for, as long as you label it correctly.

Agreed.

| III. A little messier: French, Russian, and Japanese in the same
| document.
|
| Easy, by now: just use Unicode-1-1-UTF-8, and hope the clients grok.

Or cheat. I'm not recommending this, but it would work: use a LANG
att to shift among ISO 8859 charsets in the SGML markup. For this
limited set, no harm would befall; it's not a general solution, though.

| As UTF-8 is sort of "ASCII-compatible," lazy clients can just show
| some sort of "this might look funky" dialog and display the stuff
| assuming ISO-646 or ISO-8859-1.
|
| IV. Getting Hairy: Hebrew or Arabic. Can the client infer the writing
| direction from the charset?

Yes, for the Arabic part of the Arabic ISO 8859 charset. If the user of
that charset wants to include some words in Latin characters, hmmm,
I think the browser could still infer the correct direction. Any
practitioners know the answer?

| Do we need HTML markup to represent
| language, writing direction, and/or diacritics? Does Unicode solve
| this problem somehow? I'm afraid I'm well beyond my area of expertise
| and understanding at this point.

*Especially* with Unicode, you need to know the language. Again,
the LANG att will do the job. Unless you want to do vertical,
instead of L-R and R-L, I suspect that direction is indicated sufficiently
by the charset. Can't think of a diacritics problem except for
people such as linguists (or myself, as a transliterator) who need
arbitrary diacritics. Do you want to tackle that as part of this
problem?

| V. Really Messy: Hebrew, Arabic, Japanese, and Chinese in the same
| document, using different writing directions in the same paragraph.

But only two directions.

| How many applications really require this sort of thing? At this level
| of complexity, aren't we perhaps better off doing the typesetting on
| the server side and sending something like TeX dvi information (or
| perhaps Adobe PDF) across the wire?

No. SGML is much more compact. If we support all those languages,
we have already done the work to support them intermixed. And
there is a lot of mixed-direction text around.

| As far as I can tell, the charset= parameter monkey-business allowed
| by Larry M's edits enables applications through III (rules for

Did I miss something? (maybe I did) How is III supported by the spec now?

| constructing an appropriate SGML declaration and document entity for
| parsing could be clarified in the spec, but I think it works). I'm not
| sure IV and V can be expressed in HTML as specified.

Nope. You have to go back to using the SGML decl supplied by the
originator (which can be used to construct the relevant MIME
info).

| So I don't see any "character set problem" left in the HTML 2.0 spec.
| There's the question of deployment, conformance, support and all that,
| (and maybe some language-laywering) but the spec lays out a groundwork
| for interoperability, until you get into writing system directions and
| such. For that sort of complexity, I'd expect to use some other SGML
| application (read: some other DTD) or some other format altogether,
| like Adobe PDF or TeX dvi.

I'd expect to use SGML per 8879. The DTD is irrelevant. In this
story of "when two worlds collide" HTML 2.0 throws out SGML's way
of indicating charsets in favor of MIME's. But haven't you
established (in that very useful post reposted earlier today)
that for my SGML app to be able to handle an SGML entity,
MIME really needs to tell me only what the charset of the SGML
decl is?

-- 
Terry Allen  (terry@ora.com)   O'Reilly & Associates, Inc.
Editor, Digital Media Group    101 Morris St.
			       Sebastopol, Calif., 95472
monthly column at:  http://gnn.com/meta/imedia/webworks/allen/

A Davenport Group sponsor. For information on the Davenport Group see ftp://ftp.ora.com/pub/davenport/README.html or http://www.ora.com/davenport/README.html