ISO charsets; Unicode

Richard L. Goerwitz (goer@midway.uchicago.edu)
Mon, 26 Sep 1994 15:54:18 +0100

Has a formal mechanism been considered for specifying various popular
coding standards, such as ISO 8859-7, ISO 8859-8, etc., and (perhaps
off in the future) Unicode?

Might be possible to use SGML entities for every conceivable character
in every conceivable language, but as a practical solution to a current
problem, this seems difficult at best.

The motivation for this question is essentially this: Several really
exciting developments are being stymied by the Web's largely ASCII/
English-only focus. As I discussed privately with several readers of
this forum, there is, for example, a project afoot (nearly complete)
to create a full lexicon and concordance of the Dead Sea Scrolls. I
imagine a system where users can look up words, and view the original
scrolls as inlined images. The problem is that the DSS are written
in Greek, Aramaic, and Hebrew. Specially hacked clients are only just
recently arriving that can do Japanese and a few other languages. No
general solution exists. And (perhaps most importantly) there is no-
thing in the HTML(+) descriptions that allows one to specify when text
in one language ends and text in another begins, or to specify what
encoding system is being used for either. The few hacked clients I've
seen also are not really geared for display of arbitrary languages.

The DSS project isn't the only one that appears stymied. There is a
Cushitic etymological database (say that with a mouth full) at the U
of Chicago that's machine readable, and comes replete with a standard
interface. The project head would be happy to plug it into the Web,
but again the Web only knows ASCII.

Other projects afoot are a comprehensive Aramaic dictionary. Aramaic
is the language of parts of the biblical book of Daniel and Ezra, and
a stray verse in Jeremiah. There is a huge corpus of early Christian
literature written in it, as well as several fundamental Jewish docu-
ments like the Talmud.

Then, of course, there's the giant database project called ARTFL, which
essentially attempts to make the entire French literary corpus availa-
ble online. It's already here, and tied to the Web. But they have no
standard specs for how to allow users to input things as simple as an
accute accent over an "a". They have an extremely competent staff to
work on such problems - but I wonder: Should this _be_ a problem?

I suppose I shouldn't bend anyone's ears any longer. Suffice it to say
that there are many, many projects being worked on, and many people
working on them. A lot of them simply won't be enhancing the Web in the
near future because the Web isn't (yet) really world-wide (in a cultural
or linguistic sense). Always wanting to bring disciplines together, I'm
led to ask, then:

What ideas have been floated along the lines of making the Web more all-
encompassing, linguistically speaking? Are there any practical solutions
the folks mentioned above could be working on now? Where should I direct
people who have questions about internationalization/multilingualism and
the Web? Can Humanities people help aid the process, even if many of them
are not technically oriented?

-> Richard L. Goerwitz
-> goer@mithra-orinst.uchicago.edu