Re: Charsets: Problem statement/requirements?

Albert Lunde (Albert-Lunde@nwu.edu)
Thu, 9 Feb 95 13:05:07 EST

> OK... so Larry M's edits go into the 2.0 RFC, which "solves" the
> charset problem -- i.e. the 2.0 RFC says fairly carefully what happens
> if everybody uses ISO-8859-1, while not completely disallowing other
> character sets and encodings.

First, I'd like to say we should get 2.0 out the door (regardless of
if the charset problem as addressed in it is ideal or not).

My impression (without re-reading Larry's prior posts or the new
draft) is that the effect of these edits will be to make the
web "safe" ;) for documents in a single character code which
is not too much unlike US-ASCII.

I'd like us in 2.x to address more general multi-lingual documents.

* One issue which I think is part of what Gavin is proposing is the
character code for the SGML DTD and such. We are currently defining
this in terms of ISO-8858-1 with all the important markup characters
being in US-ASCII. What we seem to be doing is assuming that
implementers will infer appropriate SGML definitions for the
actual charset by mapping into that character set. This works for
the markup characters if the code is a superset of US-ASCII, but
leaves some issues like the classes of characters underfined in
areas.

It seems possible that we could define the DTD in terms of Unicode
and map/project this onto other character sets as required *even
if Unicode is not used as the transport charset*. This may be what
Gavin is suggesting with ERCS; I'm not sure, but I think it is
worth thinking about. This could accomidate use of many characters
codes for transport, though codes that did not contain all the
markup characters might have to be converted to Unicode.

* I think we should define some form of SGML markup to be used
to indicate changes in language. A low-level mechanism for hinting
at language changes (like the use of Unicode private codes) might
use less bandwidth but be harder to implement across character
sets. I think it is useful to be able to markup language changes
even in ISO-8859-1 text.

There may be several ways to do this.

- The TEI approach seems to be to add a LANG attribute which can
be attached to containers i.e. <p lang=EN>. We could do this,
but I think we'd need to define a new container of arbitrary
length and unrelated to paragraph structure to markup words
and other chunks of text.

- We could define a new tag to indicate changes in the current
language (and/or writing system?) i.e. <lang lang="en">
(Is this too redundant?) or <ws lang="en">

(I'm leaning toward this idea.)

- We could even combine the two and allow container attributess to override
the current default language.

In any case, I think the language attribute should have the same
allowed values as the language/dialect in the HTTP Accept-Language and
Content-Language headers.

If we define a new tag we might consider if there are other attributes
that could be used to further specify the writing system. (I was looking
at the stuff in the Text Encoding Initiative and thinking it would
be nice to be able to put in an HREF to one of thier writing systems defs
but it doesn't look like they can be decoded to a usable form by
a program. So we might swipe some ideas but not their whole scheme.)

If we define a new tag we might consider allowing it to occur in
the <head> to indicate the primary/default language (or even a list
of included languages).