Re: format nego in HTML/10646?

Terry Allen (terry@ora.com)
Sun, 7 May 95 22:44:23 EDT

Gavin re Dan:
| >Regarding ISO-2022-JP...
| >> Encoding is one thing, glyphs are another. I need glyphs to render.
| >> If I can encode Hindi in the document charset 10646 using iso-2022-jp,
| >> which I have been led to believe I can do,
| > ^^^
| >misled.
|
| I'm not sure where Terry got this from, but it *is* possible to
| include Hindi in ISO-2022-JP using numeric character references. Not
| that you'd *want* to do so, except in very rare cases.

Yes, that's what I had in mind. That's just what some Japanese-oriented
converter might do when fed the Hindi doc. It was valid HTML
before and is valid HTML now.

So you wouldn't want to commit this translation, and as a doc
producer you would avoid it (even using a special SGML decl to
vet your output) but as a consumer or middleperson you might
receive the result of it. And if the charset parameter describes
only the encoding of 10646, it's a possibly weak indicator of the
code ranges used.

Consider the case of some mythical Tokyo Center
for Linguistic Research. It might well have an archive of docs
in many charsets which it has effortlessly translated to Unicode
and encoded in iso-2022-jp using a nifty tool that does numeric
charrefs automatically. In what charset encoding should it
serve out these 106460-encoded-in-iso-2022-jp files?

In other words, how is any information communicated that allows format
negotiation over the question "does the client have fonts to render this
document"? or if not "fonts", then "reasonable means"? or do we
have to accept that the world has 65,500 characters that we may
be called upon to render at a moment's notice? or do we make
some practical compromises (such as allowing as how a Hindi
doc in 106460-encoded-in-iso-2022-jp might take forever to render)?

| >> or will any encoding of any 10646 content
| >> using iso-2022-jp be limited somehow to the Japanese portion
| >> (if there is such a concept) of 10646?
| >That's more like it, at least the way I understad it. The ISO-2022-JP
| >character encoding scheme "spans" a certain character repertoire:
| >something like the 96 ASCII chars + a bunch of japanese characters.
| >Each of those characters also has a home in ISO10646; i.e. the
| >repertoire of ISO-2022-JP is a subset of the repertoire of ISO10646.
| Except when numeric character references are used, the range of
| characters from ISO 10646 a document may access will be limited by the
| range of characters the encoding can provide access to.

I agree. But as James has pointed out, we can't define out numeric
charrefs, and indeed they might be useful. Using iso-2022-jp
as an encoding of Unicode for SGML is not as restrictive as using
it as a system charset, and that is where suggested shortcuts
are coming up short.

| >How this works with HTTP and real world clients is something I'd like
| >to see. I'm not convinced Accept-Charset is sufficiently powerful _or_
| >sufficiently simple.

I think we're missing at least one piece. Glenn suggested specifying
the System Declaration, and that might be a good thing.

| In general, it will be very simple:
|
| 1) The client sends a request containing the encodings it can handle
| (via Accept-Charset, though I tend to agree with Larry that
| Accept-Parameter might be better).
| 2) The server send a document in one of the client's desired
| encodings.
| 3) The client decodes the document into integers representing the
| characters of the document. Here, there will generally be 2
| options:
| a) Use Unicode internally, in which case it's basically
| table-driven conversion. This is probably the optimal case.
| b) Use various different charcater sets internally. This is
| generally done by restricting the client to accepting
| US-ASCII supersets (in terms of code points, and
| characters), and simply treating all non-markup characters
| as data.
| 4) Numeric character references are resolved in terms of ISO 10646 and
| them mapped to some system representation.

and then *the whole thing* is mapped to some system representation,
but yes, I agree.

| In (3b), which is probably the most common at the moment, the range of
| characters a document can directly access is limited by the range of
| characters in the encoding. The only time this will be violated is
| when (4), accesses a character not within the allowed range of
| characters in the encoding.

Right. And if 10646 is to become the doc charset, those will be
valid charrefs. Fine. I just want to know how I can tell from
the charset parameter whether I have fonts for all the characters
in the doc. Certainly if the value is iso-2022-jp that's a strong
indication that the doc could be rendered with fonts for Japanese,
but then again anything might be in there via numerical charrefs.

-- 
Terry Allen  (terry@ora.com)   O'Reilly & Associates, Inc.
Editor, Digital Media Group    101 Morris St.
			       Sebastopol, Calif., 95472
occasional column at:  http://gnn.com/meta/imedia/webworks/allen/

A Davenport Group sponsor. For information on the Davenport Group see ftp://ftp.ora.com/pub/davenport/README.html or http://www.ora.com/davenport/README.html