Re: ISO charsets; Unicode

Richard L. Goerwitz (goer@midway.uchicago.edu)
Wed, 28 Sep 1994 05:02:05 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Nathaniel Borenstein: "Re: Forms support in clients"
Previous message: Dave Raggett: "Re: Forms support in clients"
Maybe in reply to: Richard L. Goerwitz: "ISO charsets; Unicode"
Next in thread: Stavros Macrakis: "Re: ISO charsets; Unicode"

> The LANG attribute is essential for handling text which reads right
> to left rather than left to right....
>
>Actually, all that is needed is a unique identification of each
>presentation character as right-to-left or left-to-right. If a viewer
>encounters the logical sequence of letters Arabic-J Arabic-M Arabic-L,
>is presenting them using Arabic script, and has an Arabic font to
>present it in, it should display the glyphs...

I have to agree with Dave here, though you are persuasive enough,
I'll admit. The basic point is that various coding schemes overlap.
You can't assume that everyone will jump on the Unicode bandwagon
right away. In come contexts, 8-bit characters will always be with
us. So we are left with only one option - a LANG attribute, plus
some other attribute designating the encoding scheme used.

Just in theoretical terms, it's more pleasing to talk about languages
then characters, anyway. After all, I can write a glyph any way I
want. What determines how the glyph relates to other glyphs is the
script system it belongs to.

> Final-Arabic-L Median-Arabic-M Initial-Arabic-J

Arabic is quite beautiful, isn't it?

>Note that this requires two kinds of information: first, that Arabic
>uses distinct glyphs for letters depending on adjacent letters, and
>secondly, that Arabic script is written right-to-left.
>
>If however the viewer cannot display Arabic script, or if the user
>prefers Latin script (perhaps s/he doesn't even read Arabic script,
>but is consulting the etymology of a word that comes from Arabic in a
>dictionary), it may well choose to present it in transliteration as
>"jml" in that order.

This is a really thoughtful point, and frankly it had not occurred
to me before. There are, indeed, international standards for trans-
literating Arabic, as for many other languages. Your idea, though,
is not practical because there aren't always one-to-one correspon-
dences. Take, for example, the classical Hebrew shwa. How do we
do it in English? First of all it is a diacritic. Secondly it is
pronounced differently in different contexts - sometimes as nothing
at all. It's a bit like rendering English "wine" in a foreign
script. Do we transliterate the final -e?

Unfortunately, transliteration requires more than a simple mapping
of one charset to another. Knowledge of the underlying language is
required. So I vote that we stick with a LANG attribute. If a
client runs into Arabic, and can't display Arabic, then it's out of
luck. I don't think that automatic translation into a Latin font
is practical for enough cases to warrant building it into the cli-
ents along with everything else we're proposing.

>The algorithm for displaying mixed left-to-right and right-to-left
>glyphs is pretty straightforward and is presented in the Unicode
>documents. There is NO EXCUSE for using presentation order in HTML
>documents.

Here here!

I'm really amazed at how much people here know about so many dif-
ferent things.

Richard Goerwitz
goer@midway.uchicago.edu

Next message: Nathaniel Borenstein: "Re: Forms support in clients"
Previous message: Dave Raggett: "Re: Forms support in clients"
Maybe in reply to: Richard L. Goerwitz: "ISO charsets; Unicode"
Next in thread: Stavros Macrakis: "Re: ISO charsets; Unicode"