Yeah... what he said.
These are some of the issues behind my resistance to making
ISO10646 _the_ document character set for HTML 2.0. In some ways, it
boils down to: we just don't have experience with this beast.
Gavin Nicole (and others) have experience with this beast. And they're
trying to share it with us in his paper:
The Multilingual World Wide Web
Thu Apr 27 03:11:08 1995
http://www.ebt.com:8080/docs/multilingual-www.html
I'm encouraged by the broad range of issues discussed in the paper,
but I am unable to evaluate the specifics of his proposed solution,
because his terminology seems inconsistent to me:
For example:
Character Code
AN integer which uniquely identifies a character within a character
repertoire.
Character Repertoire
A set of characters used together. Meanings are defined for each
character, and possibly for control sequences
How does an integer (a character code) identify an element of a set (a
character repertiore)? Sets have no "natural" order or index. To
assign a number to each element of a set is to make a function, which
I refer to as a "coded character set."
I have had feedback that the recent (significantly revised) version of
my "'Character Set' considered Harmful" draft is an effective
discussion of the issues:
"Character Set" Considered Harmful
Tue May 2 06:06:12 1995
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html
aka:
ftp://ds.internic.net/internet-drafts/draft-ietf-html-charset-harmful-00.txt
May 2, 1995
Regarding ISO-2022-JP...
> Encoding is one thing, glyphs are another. I need glyphs to render.
> If I can encode Hindi in the document charset 10646 using iso-2022-jp,
> which I have been led to believe I can do,
^^^
misled.
> or will any encoding of any 10646 content
> using iso-2022-jp be limited somehow to the Japanese portion
> (if there is such a concept) of 10646?
That's more like it, at least the way I understad it. The ISO-2022-JP
character encoding scheme "spans" a certain character repertoire:
something like the 96 ASCII chars + a bunch of japanese characters.
Each of those characters also has a home in ISO10646; i.e. the
repertoire of ISO-2022-JP is a subset of the repertoire of ISO10646.
Apparently, there are even some declarations about subsets of
ISO10646. I'm sure Glenn can fill in the details.
How this works with HTTP and real world clients is something I'd like
to see. I'm not convinced Accept-Charset is sufficiently powerful _or_
sufficiently simple.
Dan