Re: Charset parameter [Was: Tentative Agenda for IETF meeting ]

Daniel W. Connolly (connolly@hal.com)
Fri, 2 Dec 94 18:19:53 EST

In message <199412022249.OAA28939@rock>, Terry Allen writes:
>
>That's what I meant:
>
><p charset=ISO8859-6>imagine this is really Arabic
></p>
><p charset=ISO8859-1>back to Latin 1.
></p>
>
>Does that imply a MIME charset parameter of
> charset=ISO8859-6, ISO8859-1
>?

The above is not representable in MIME terms, and represents a really
nasty interaction between levels of an SGML system. I suspect that
it conflicts with the SGML standard somehow, though I'm not certain.

recall the model:

MIME body(sequence of bytes) --entity manager-->
SGML document entity (sequence of chars) --SGML parser-->
ESIS etc. --user agent-->
glyphs on screen

What you've written above implies that the user agent, on seeing a
value of "ISO8859-6" for the charset parameter, tells the entity
manager to change encodings. Blech.

There are techniques where the switch between character encoding modes
is encoded in the byte stream, not in any SGML markup.

It looks something something like:

ESC-$-!(imagine this is arabic)ESC-$-)back to latin-1

so the bytes->chars interactions are all within the entity manager.
But it means that the "sequence of characters" data that passes from
the entity manager to the SGML parser has to be able to represent the
union of the arabic character set and the latin-1 character set.

For example, James Clark's SP parser supports a 16-bit wide Unicode
pipe between the entity manager and the parser. The entity manager
is equipped to handle a variety of character encodings, but they're
all normalized to unicode for parsing.

I think.

Dan