Re: charset parameter (long)

Daniel W. Connolly (connolly@hal.com)
Mon, 16 Jan 95 12:21:13 EST

In message <9501161151.AA03123@dragget.hpl.hp.com>, "Dave Raggett" writes:
>Larry writes:
>
>> For the discussion on character sets, think of documents being
>> represented at three levels:
>
>> entity: A stream of entities,
>> represented in SGML by
>> character: a stream of characters,
>> represented by a character encoding (charset) by
>> byte: a stream of bytes.
>
>> Don't try to say things about changes in the byte->character level by
>> declarations in the character->entity stream.
>
>I am not quite sure I agree with this, as I want to include in the byte
>stream, information about direction, changes to character set, language
>and so on.

[...]

>The bottom line is: lets leave info about directionality and multiple
>character sets out of the SGML markup, and instead put it where it
>belongs, in discussions about the character transfer stream.

Hmmm... sounds to me like you very much agree with Larry, as do I.

Gory details:

> SGML assumes fixed width characters according to the
>definitions in the <!SGML> declaration.

I don't find any basis in fact for this. The <!SGML> declaration
specifies a _character_set_, that is, a mapping from integers to
characters. (_not_ from bytes to characters!).

As an example, it is entirely possible that a document might specify
the ISO-646IRV character set in its <!SGML> delcaration, and yet it
might be _encoded_ in EBCDIC on the disk. In such a document, the
character sequence:

&#65;

would be encoded as the EBCDIC bytes for '&', '#', '6', '5', ';', but
to find the meaning of the markup, you'd consult the <!SGML>
declaration, which specifies ISO-646, so that this markup stands for a
'A' character.

So the encoding of the characters is independent of their position
in the character set.

Encoding techniques such as ISO2022, Unicode-UTF-7, Unicode-UTF-8,
etc. all seem consistent with the above interpretation, and in these
encodings, there isn't necessarily a 1-1 correspondence between bytes
(of any size) and characters.

>It doesn't support multiple
>character sets as such.

True. An SGML document has exactly one document character set.

> James Clark has a proposal for how to use the
>entity manager to handle these though. A simple approach is to use
>Unicode within the sgml parser, mapping to it from other character sets.

The technique used in Clark's SP parser is only in the entity manager
-- it is used to deal with various character _encodings_ in the entity
manager, not multiple character _sets_ in the parser.

>Another idea, I prefer, is to define the internal character set dynamically
>according to the needs of the external character stream, i.e. the internal
>character set grows to incorporate all the characters needed for that
>docuument. This approach hides the display direction and other parameters
>from the sgml parser, leaving it up to the formatting code to make use of.

This is pretty much the same as Jame's Clark's technique: the so
called "internal character set" is just the document character set --
the set of characters presented to the parser. The character set
declaration in the <!SGML> declaration tells the parser which
characters might signal markup, and which ones are just data
characters. It's reasonable to declare characters that mean
nothing to the parser, but mean "change directions" or whatever
to the application. The syntax of such delcarations is not something
I'm intimately familiar with, but I'm confident that such things
are expressible, theoretically.

But there are practical considerations: how does an author put one of
these "direction change" characters into a document? I suppose the
issues are already addressed in existing multilingual composition
interfaces, and we just need to find a reasonable representation of
the idioms.

But how many of the existing multilingual composisition interfaces are
built on top of ASCII (e.g. TeX)? I suppose it's possible to come up
with some hack where the string "\left-to-right" is consumed by the
entity manager and changed into one character for the purposes
of the parser and formatting application, but I doubt that's a good
idea. Hmmm....

Dan