Re: Charsets: Problem statement/requirements?

Anoosh Hosseini (anoosh@gorgan.mti.sgi.com)
Wed, 8 Feb 95 21:07:12 EST

On Feb 8, 6:44pm, Daniel W. Connolly wrote:
> Subject: Charsets: Problem statement/requirements?
>

(deleted some stuff to save bandwidth)

> I'd like to start a discussion on what folks think the problem with
> character sets on the web is, exactly, and what are their requirements
> for a solution. I believe that the recent draft is consistent with a
> comprehensive solution, but I don't believe it is complete (for
> example, there are some HTTP issues, I expect.)
>
>
> Content-Type: text/html; charset=ISO-8859-n
>
> ** How does the server configuration work to get this on the wire?
>
> Russian-capable clients recognize 8859-n, and select appropriate fonts.
>
> ** What happens with existing clients?
> (Mosaic 2.4 will fail to recognize it as text/html at all,
> and put to "save to file" dialog. This is reasonable/reliable
> behavior, in my book)
>
> ** What should future clients do?
> I can't read russian. Most of the planet can't read russian.
> I won't pay extra for a client that deals with russian.
> It's not cost-effective to require the whole world to support
> russian character sets.
>

Agreed 100%, this is how the World will really be. So what we are
talking about is a generic mechanism to negociate what is sent down
by the server. Now the client may be out of date, but if the servers
are supporting say CERN's .multi feature, I as
the server would know that the client does not support language XX and
would send down a Latin1 doc explaining why the user cannot read this
document. This is the proper response rather than saving unreadable
material to disk.

> If I were an engineer at NetScape, for example, I'd cringe at
> the thought of required suport for Russian fonts.
>
> The minimum acceptable behaviour, IMHO is the punt-and-save-to-file.
> This is true for _any_ character set. I don't think it's fair
> to say "you must support Latin1" though in practice, I think
> everybody will (or at least everyone will support US-ASCII,
> i.e. ISO-646).
>
> A nice browser might check for ISO-8859-n fonts on your system,
> and use them if available; else, it might warn you "this will
> look funky" and perhaps offer to save to a file.
>
>
> II. Slightly tricker case: Japanese, korean, and other character sets
> that won't fit in 8 bits. Same story: just use ISO-2022-JP, or
> Unicide-1-1-UTF-8.
>
> As an implementor, by now, I'm getting tired of supporting 147
> different fonts and character encodings. I'm starting to believe in
> Unicode. So I make the internals of my client unicode-based, and I
> distribute unicode fonts with my browser somehow (this is extra-tricky
> on X platforms).
>

I think there is an over simplification with Unicode as a solve all.
Unicode is not a 16 bit ASCII which by providing a 16 bit font we are "done".
This might be true for many languages (1->1 mapping between character
encoding and glyph) but not in Persian/Arabic.

Second we may agree to talk Unicode, but that does not require me
to store 16 bit. For example if I only support Russian in my client,
I will know that you have sent me Russian Unicode, and will map that on
the fly to my 8 bit internal representation and use 8 bit indexed fonts
(dont forget PC's). I as a client should not be forced to support anything
more that latin1, and anything above that I will inform the server.

>
>
> ** Should a client be able to influence the charset that the server
> uses to represent a document?
> My opinion: maybe. Some sort of Accept-Charset: parameter
> might be used to express preferences, but it shouldn't
> be a viloation to send some charset that the client didn't
> ask for, as long as you label it correctly.
>

I would hope that future servers would not send down encodings
which the client has not indicated support for. Errors should be
communicated in Latin1 as mentioned before.

>
> IV. Getting Hairy: Hebrew or Arabic. Can the client infer the writing
> direction from the charset? Do we need HTML markup to represent
> language, writing direction, and/or diacritics? Does Unicode solve
> this problem somehow? I'm afraid I'm well beyond my area of expertise
> and understanding at this point.

All current 8 bit standards for Heb/Arabic do not specify a direction,
many times applications have a "align left" or "align right" mode which
then determines how the encoding will be rendered. If you travel to the
Middle East, most screens will be right aligned, because they mainly
type R->L text and once in a while use an English term (L->R).
Unicode has introduced direction codes to give hint to the rendering
engine. And infact this is my point, to do Arabic/Persian/English,
meaning bidirectional text, you need a rendering engine which does
both directional analysis and context analysis for glyph selection.
A Unicode font does not make the client "Arabized". There are tricks
one can play such as using an aligned right tag combined with sending
glyph indexes (rather than character encoding), but this enters the
world of hacking.

As for bidirectionality I currently use an algorithm which by heuristic
determines direction (for 8 bit encodings with no direction hint).
In the HTML world this would mean evaluating structures as one unit
(such a a group of numbered lists).
>
>
> V. Really Messy: Hebrew, Arabic, Japanese, and Chinese in the same
> document, using different writing directions in the same paragraph.
>
> How many applications really require this sort of thing? At this level
> of complexity, aren't we perhaps better off doing the typesetting on
> the server side and sending something like TeX dvi information (or
> perhaps Adobe PDF) across the wire?

This is mainly a rendering problem rather than a communication problem which
brings up the issue of multi-local (correct term?) versus multi-lingual
clients.

-anoosh
(speaking for myself)