Re: charset parameter (long)

Dave Raggett (
Mon, 16 Jan 95 06:54:14 EST

Larry writes:

> For the discussion on character sets, think of documents being
> represented at three levels:

> entity: A stream of entities,
> represented in SGML by
> character: a stream of characters,
> represented by a character encoding (charset) by
> byte: a stream of bytes.

> Don't try to say things about changes in the byte->character level by
> declarations in the character->entity stream.

I am not quite sure I agree with this, as I want to include in the byte
stream, information about direction, changes to character set, language
and so on. SGML assumes fixed width characters according to the
definitions in the <!SGML> declaration. It doesn't support multiple
character sets as such. James Clark has a proposal for how to use the
entity manager to handle these though. A simple approach is to use
Unicode within the sgml parser, mapping to it from other character sets.
Another idea, I prefer, is to define the internal character set dynamically
according to the needs of the external character stream, i.e. the internal
character set grows to incorporate all the characters needed for that
docuument. This approach hides the display direction and other parameters
from the sgml parser, leaving it up to the formatting code to make use of.

The bottom line is: lets leave info about directionality and multiple
character sets out of the SGML markup, and instead put it where it
belongs, in discussions about the character transfer stream.

-- Dave Raggett <> tel: +44 272 228046 fax: +44 272 228003
Hewlett Packard Laboratories, Filton Road, Bristol BS12 6QZ, United Kingdom