Re: ISO charsets; Unicode

Judith E. Grass (jgrass@CNRI.Reston.VA.US)
Fri, 30 Sep 1994 00:32:06 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Nathaniel Borenstein: "Re: Languages (was Re: Forms support in clients)"
Previous message: Daniel W. Connolly: "Re: Languages (was Re: Forms support in clients)"
Maybe in reply to: Richard L. Goerwitz: "ISO charsets; Unicode"
Next in thread: Richard L. Goerwitz: "Re: ISO charsets; Unicode"

Re: Transliteration

Some languages are relatively easy to transliterate: e.g. Russian

Some are really, really difficult: Japanese written in Kanji... where
there is no simple one to one correspondence between a single character
and the transliteration. To transliterate you need context, and you might
even need to actually understand what is written in order to disambiguate
some cases... and maybe even to put any word breaks into
the transcription, since Japanese doesn't usually have any (and doesn't
need any). Japanese is readable in English transcription... and for those
gajgin who don't know a whole lot of Kanji, it can be easier. Once your
kanji vocabulary is big enough, the kanji actually is easier to deal with.

BUT: transcription systems even for something like Russian are not simple.
For English speakers there are at least three different systems in
common use and the choice depends pretty much on use:
Library of Congress system
Slavic Linguists system (lots of hacheks and diacritics)
Third system whose name escapes me (no diacritics, less precise than ones
above)

A fourth system has also appeared based on a encoding in common use
in the former USSR (I think this is KOI-8, but it may be one of the variants of
it) that pretty much transliterates cyrillic character by chopping off the
8th bit of the character which yields a rough ascii transliteration.... which
can be read off the screen directly that way and is perfectly
understandable and useful. More than a little bit of email gets
transmitted this way, and some of it never gets transliterated back into
cyrillic.

A second thing to understand about transcription, beyond the fact that there
may be multiple systems within a language, is that the transcription system
is frequently different for different languages. The reason you listen to
ballets with music by "Tchaikowsky" is because his name came to us via the
French. If it had come via some English speaking Slavic scholar, it might
have been "Chajkovskij" (Maybe "C-hachek" rather than "Ch", though).

For a real good time, look at Korean. A fascinating writing system
that I suspect would give a machine transliteration system a run for
its money, although this is way out of the range of languages that I
have studied.

An additional related point: The English-Arabic dictionary is one thing,
but how about Armenian-Russian or Arabic-Chinese? One object I heard from
the Russians and Balts that I have spoken to is that even the attempts to
standardize on expanded character sets have tended to ignore THESE kinds
of mixtures, showing a kind of western europe-fixation that does not solve
THEIR problems.

-- Judy Grass, CNRI resident ex-slavicist

Next message: Nathaniel Borenstein: "Re: Languages (was Re: Forms support in clients)"
Previous message: Daniel W. Connolly: "Re: Languages (was Re: Forms support in clients)"
Maybe in reply to: Richard L. Goerwitz: "ISO charsets; Unicode"
Next in thread: Richard L. Goerwitz: "Re: ISO charsets; Unicode"