Re: Charsets: Problem statement/requirements?

Luke ~{B7?M~} (ylu@ccwf.cc.utexas.edu)
Fri, 10 Feb 95 16:00:21 EST

On Fri, 10 Feb 1995, Gavin Nicol wrote:

> >I think this is better: <lang enc="iso-8859-1">....</lang> and <lang
> >enc="iso-whatever">...</lang> etc.
>
> I notice a suspicious lack of anything other than iso in the above...

Unintentional and irrelavent.

> >You can even include unicode which is iso-? this way.
>
> Incorrect. You can include a iso-* compatible set of byte values that
> can be mapped to Unicode characters.

How is that incorrect? Once you define an escape sequence for an
charset/encoding scheme you can call up a specific module to handle that
specific charset once you detect the sequence, be it Unicode or whatnot.

> >I think it's not necessary to differentiate languages using the same
> >encoding scheme (e.g. french and german). One usage to to
> >differentiate particular languages is to facilitate automatic
> >translation. But I think if a translator can't figure out which
> >language by looking at the raw bytes of a known encoding scheme, it's
> >pretty much useless.
>
> And how is the translator supposed to do this if the encoding, or the
> character set don't provide the information? You seem to have
> contradicted yourself in a single paragraph...

As a vanilla/plain/ordinary mortal I can tell german from french just by
looking at the text using the same charset, and yet I can't really read
these languages let alone be a translator. I mean if an automatic
translator can't figure that out using some fuzzy logic, it just ain't up
to the snuff :). Anyway, nothing prevent you from adding something like
<lang enc=".." lc="fr"> to give enough hint to someone/thing really needs
it.

> >Hate to say me too, but I happen to like this tagged idea.
>
> Tags are fine. They have a place, but not to allow <emph>multiple
> character sets</>.

Multiple charsets is not necessary a bad thing. Usually only one major
charset is used in one environment. Multilingual people can use multiple
charsets. I don't see a problem.

> >The more I think about Unicode (since 1991) the more I think it
> >sucks. It's not an extensible scheme (fixed 2-byte scheme) and
> >require too much changes to the current software.
>
> Bollocks. The conversion to Unicode is usually less expensive, and
> less difficult than supporting a myriad character sets and encodings
> <emph>because the framework is the same, except that in Unicode, you
> only have a single character set to worry about.</> Unicode has a
> finite set of codes, that is true. So does any character set. Unicode
> covers a very large portion of the languages currently used in an
> admirable manner though.

Again, multiple charsets is not necessary a bad thing, although it makes
sense to use the same charset/encoding scheme for _similar_ languages.
Depending on what framework you are using, it's not _that_ difficult to
develop a system to handle multiple charsets in an efficient and modular
manner. Every plug-in module handles one charset. If you only understand
english, it makes no sense to include megabytes of Chinese, Japanese, and
Tibetan etc. fonts in your system whether you use Unicode or whatnot. If
you happen to encounter a language you _don't_ understand, it makes
_absolute no difference_ whether the browser correctly maps it to a
specific native font or says: "hey, man, I can't display this in <some
language>, since <some language> module is not installed" or "hey, man, I
can't find the font for <some_language> or just displays some rubbish on
the screen...

> >Converting originally 1-byte charset language to 2-byte unicode is a
> >waste of resources. I like the tagged idea, i.e. use a sequence of
> >escape code to indicate the change of language settings. This
> >approach is very extensible, it even works with Klingon, Minbarian,
> >etc. languages which might as well use m.n-byte charset. ISO would
> >then stand for interstellar Standard Org.:-)
>
> The extensibility is the very thing that kills it. Read ISO 2022, and
> weep.
>
> >Seriously, HZ code which is the most widely used 7-bit chinese encoding
> >scheme on USENET, uses this idea i.e uses ~{ and ~} as escape
> >sequence. It mixes very well with iso-8895-1 character set. This
> >can't be said with other chinese encoding schemes eg. big5 that
> >don't use this approach.
>
> Sure. So does does EUC, ISO-2022-JP, and a myriad other encodings of
> various characters sets. The "myriad" and "various" in the last
> sentence should be taken very seriously...

If you don't read Japanese, these should not bother you. The "myriad" and
"various" justify a common standard for _similar_ languages to reduce
_unnecessary_ charsets. But they do _not_ justifiy Unicode, since
different charsets are _necessary_ for totally different languages. See
below.

> >I see the problem. i.e. you can't mix language nicely within a UR*, if
> >we use the <lang..> tags. Maybe we should try a different escape sequence
> >or use &lt;lang..&gt; as an equivalent?
>
> I see another problem: name a single character set that allows multiple
> languages....

Again, multiple charsets is not necessary a bad thing. Another bad thing
about unicode -- I'll beat it to death -- or any attempt to use a single
charset to cover all kinds of languages: it's totally ignorant of the
internal possibly dynamic structure of a particular charset, it just maps
one of the 65336 number to a specific char, which is pretty dumb. It
assumes a language is static in terms of a charset, which might be true for
alphabetic languages, but it's certainly _false_ for a glyphic language
like Chinese. Peope create new chinese characters and depreciate old
characters all the time, according to certain rules, i.e. you can say,
every single Chinese character might consists of several sub-characters
(pian1pang2 and bu4shou3). Some contribute to the form of a character,
some to the meaning and some to the sound of the entire character depending
on the _spatial positions_ and combination of these sub-characters. In a
sense, alphabetic language is one dimensional, while chinese is 2-D. A
single chinese _character_ can be a _word_ which has meanings. To
illustrate this more for english speakers: When you read "Alice in
wonderland", where there is one (maybe several) chapter where the author
made up quite a few nonexistent words in those cute little verses, you
can't help but understand and smile. There are similar techniques in
Chinese literature that made up non-existent but meaningful characters,
some of them stay in the mainstream and become part of the frequently used
charset. I've read one of the better translated chinese version of "Alice
in Wonderland" and amazed at the translator's ability to catch the essense
of the verse using the above character-making techniques. You can't
possibly enjoy it with encoding scheme like Unicode. This is a somewhat
extreme example but you got the idea. A good encoding scheme for Chinese
would reflect the finer structure of a character and facilitate
creation/rendering of new characters. Such schemes _do exist_, though they
are far from perfect. Much research is needed in this area.

Unicode is a dead-end, IMHO, simply rush to Unicode is _not_ wise, period.

ACA, TIA

__Luke

--
Luke Y. Lu
mailto:ylu@mail.utexas.edu/
http://www.utexas.edu/~lyl/