Re: format nego in HTML/10646?

Dan Connolly (connolly@w3.org)
Sat, 6 May 95 19:44:26 EDT

Terry Allen writes:
> Charsets smaller than Unicode have, mostly, natural relations to
> languages and to fonts. For those charsets one could infer from the
> charset parameter what fonts might be needed.
>
> Unicode is a different story. If the
> document charset of HTML is to be Unicode, then anyone can hand
> me a valid, conforming HTML doc that has characters in it I won't
> be able to render unless I have a full set of glyphs for all
> 65,500+ characters. Most of us won't. How do we manage that
> practically? How do I determine, without parsing the doc, what
> range of 10646 it uses? or do I have to live with not being
> able to do that? (I'm just exploring this issue, not taking a
> side.)

Yeah... what he said.

These are some of the issues behind my resistance to making
ISO10646 _the_ document character set for HTML 2.0. In some ways, it
boils down to: we just don't have experience with this beast.

Gavin Nicole (and others) have experience with this beast. And they're
trying to share it with us in his paper:

The Multilingual World Wide Web
Thu Apr 27 03:11:08 1995
http://www.ebt.com:8080/docs/multilingual-www.html

I'm encouraged by the broad range of issues discussed in the paper,
but I am unable to evaluate the specifics of his proposed solution,
because his terminology seems inconsistent to me:

For example:

Character Code
AN integer which uniquely identifies a character within a character
repertoire.
Character Repertoire
A set of characters used together. Meanings are defined for each
character, and possibly for control sequences

How does an integer (a character code) identify an element of a set (a
character repertiore)? Sets have no "natural" order or index. To
assign a number to each element of a set is to make a function, which
I refer to as a "coded character set."

I have had feedback that the recent (significantly revised) version of
my "'Character Set' considered Harmful" draft is an effective
discussion of the issues:

"Character Set" Considered Harmful
Tue May 2 06:06:12 1995
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html

aka:
ftp://ds.internic.net/internet-drafts/draft-ietf-html-charset-harmful-00.txt
May 2, 1995

Regarding ISO-2022-JP...

> Encoding is one thing, glyphs are another. I need glyphs to render.
> If I can encode Hindi in the document charset 10646 using iso-2022-jp,
> which I have been led to believe I can do,
^^^
misled.

> or will any encoding of any 10646 content
> using iso-2022-jp be limited somehow to the Japanese portion
> (if there is such a concept) of 10646?

That's more like it, at least the way I understad it. The ISO-2022-JP
character encoding scheme "spans" a certain character repertoire:
something like the 96 ASCII chars + a bunch of japanese characters.
Each of those characters also has a home in ISO10646; i.e. the
repertoire of ISO-2022-JP is a subset of the repertoire of ISO10646.
Apparently, there are even some declarations about subsets of
ISO10646. I'm sure Glenn can fill in the details.

How this works with HTTP and real world clients is something I'd like
to see. I'm not convinced Accept-Charset is sufficiently powerful _or_
sufficiently simple.

Dan