Re: format nego in HTML/10646?

Gavin Nicol (gtn@ebt.com)
Sun, 7 May 95 20:43:07 EDT

>I'm encouraged by the broad range of issues discussed in the paper,
>but I am unable to evaluate the specifics of his proposed solution,
>because his terminology seems inconsistent to me:

Well, that was written in the few spare moments I have every day,
rather than as a concentrated effort, so inconsistency is not
surprising. Hopefully, I'll be able to correct many of these problems
as time goes by.

>Regarding ISO-2022-JP...
>
>> Encoding is one thing, glyphs are another. I need glyphs to render.
>> If I can encode Hindi in the document charset 10646 using iso-2022-jp,
>> which I have been led to believe I can do,
> ^^^
>misled.

I'm not sure where Terry got this from, but it *is* possible to
include Hindi in ISO-2022-JP using numeric character references. Not
that you'd *want* to do so, except in very rare cases.

>> or will any encoding of any 10646 content
>> using iso-2022-jp be limited somehow to the Japanese portion
>> (if there is such a concept) of 10646?
>
>That's more like it, at least the way I understad it. The ISO-2022-JP
>character encoding scheme "spans" a certain character repertoire:
>something like the 96 ASCII chars + a bunch of japanese characters.
>Each of those characters also has a home in ISO10646; i.e. the
>repertoire of ISO-2022-JP is a subset of the repertoire of ISO10646.

Except when numeric character references are used, the range of
characters from ISO 10646 a document may access will be limited by the
range of characters the encoding can provide access to.

>How this works with HTTP and real world clients is something I'd like
>to see. I'm not convinced Accept-Charset is sufficiently powerful _or_
>sufficiently simple.

In general, it will be very simple:

1) The client sends a request containing the encodings it can handle
(via Accept-Charset, though I tend to agree with Larry that
Accept-Parameter might be better).
2) The server send a document in one of the client's desired
encodings.
3) The client decodes the document into integers representing the
characters of the document. Here, there will generally be 2
options:
a) Use Unicode internally, in which case it's basically
table-driven conversion. This is probably the optimal case.
b) Use various different charcater sets internally. This is
generally done by restricting the client to accepting
US-ASCII supersets (in terms of code points, and
characters), and simply treating all non-markup characters
as data.
4) Numeric character references are resolved in terms of ISO 10646 and
them mapped to some system representation.

In (3b), which is probably the most common at the moment, the range of
characters a document can directly access is limited by the range of
characters in the encoding. The only time this will be violated is
when (4), accesses a character not within the allowed range of
characters in the encoding.

Once the characters have been decoded, the document is parsed. In the
abstract, the document is being arsed by an SGML parser working with
characters from ISO 10646. In reality, the parser is probably working
with codes, and more often than not, just looking for markup by
treating all non-US-ASCII codes as data. If a numeric charcter
reference is resolved to a character outside the range representable
by the system, the application can do as it pleases, including
excluding the character, or mapping it to some other character

I think Amanda could probably give you a good example of (2a), and
Mosaic L10N represents a good example of (2b). For Mosaic L10N to be
optimal under the above scenario, numeric character references
resolution should be changed.