Re: charset parameter (long)

Gavin Nicol (
Mon, 16 Jan 95 17:21:38 EST

Dave Ragget writes:

>I am not quite sure I agree with this, as I want to include in the byte
>stream, information about direction, changes to character set, language
>and so on.

That would be ideal, but somewhat hard to do. What do those extra
bytes mean to the parser?

>SGML assumes fixed width characters according to the
>definitions in the <!SGML> declaration. It doesn't support multiple
>character sets as such. James Clark has a proposal for how to use the
>entity manager to handle these though. A simple approach is to use
>Unicode within the sgml parser, mapping to it from other character

The latter is essentially what ERCS proposes. James Clark's model is
extant within his sp parser, but it seems to require quite complicated

>Another idea, I prefer, is to define the internal character set dynamically
>according to the needs of the external character stream, i.e. the internal
>character set grows to incorporate all the characters needed for that
>docuument. This approach hides the display direction and other parameters
>from the sgml parser, leaving it up to the formatting code to make use of.

One could implement this idea, but I doubt it would be easy, or
efficient. It is not simply a matter of mapping characters to this
dynamic internal table: you must also change the contents of the
tables used for character class mapping. In many cases, you will need
a large dynamic internal table (16 bits) to handle all the languages.
In addition, in the display subsystem, you will either need 16 bits,
or multiple mapping tables. Which is simpler?

>The bottom line is: lets leave info about directionality and multiple
>character sets out of the SGML markup, and instead put it where it
>belongs, in discussions about the character transfer stream.

I tend to agree with this. One reason I like the idea for using 2
codes from the Private Use Area is that we can get both the high
level, and low level, for little extra cost.