Re: charset parameter (long)

Gavin Nicol (
Mon, 16 Jan 95 17:54:05 EST

>> SGML assumes fixed width characters according to the
>>definitions in the <!SGML> declaration.
>I don't find any basis in fact for this. The <!SGML> declaration
>specifies a _character_set_, that is, a mapping from integers to
>characters. (_not_ from bytes to characters!).

In chapter 13 of Goldfard (I forget the exact section, but around 13.3
or so) he essentially says that a character set is a mapping from
codes of a fixed number of bits to characters, and that there must be
a single document character set. There is a mechanism defined for
handling multibyte encodings, but it doesn't really work.

>As an example, it is entirely possible that a document might specify
>the ISO-646IRV character set in its <!SGML> delcaration, and yet it
>might be _encoded_ in EBCDIC on the disk. In such a document, the
>So the encoding of the characters is independent of their position
>in the character set.

Quite correct. The "normalisation" process I refer to is conversion
from a particular encoding into the document character set (and
Unicode defines a canonical ordering as well). For example, if a
document is in SJIS, and the document character set (from the SGML
declaration) is Unicode, some conversion is required.

>The technique used in Clark's SP parser is only in the entity manager
>-- it is used to deal with various character _encodings_ in the entity
>manager, not multiple character _sets_ in the parser.

Yes, and this is a *very* important distinction to make. I have said
before that I think Unicode should be the lingua franca, and one of
the primary reasons is that we would then get parsers optimised for
Unicode parsing (ie. the document character set assumed would be
Unicode), and we could convert into that at either the client, or the
server side. I pointed out that if we do it on the server side, it
will simplify the clients by not requiring them to understand, or
potentially need to understand, a *huge* number of possible encodings.
>>Another idea, I prefer, is to define the internal character set dynamically
>>according to the needs of the external character stream, i.e. the internal
>>character set grows to incorporate all the characters needed for that
>>docuument. This approach hides the display direction and other parameters
>>from the sgml parser, leaving it up to the formatting code to make use of.
>This is pretty much the same as Jame's Clark's technique: the so
>called "internal character set" is just the document character set --

It depends of whether the codes remain the same as in their native
character set, or even if they remain in the same position.

>It's reasonable to declare characters that mean nothing to the
>parser, but mean "change directions" or whatever to the application.
>The syntax of such delcarations is not something I'm intimately
>familiar with, but I'm confident that such things are expressible,

Sure this is possible. In fact, the ERCS does this by simply saying
that all Private Use Area codes can only possibly be data. I will
admit that I think it would be simpler to define half a dozen
characters that represent such things, and to use them. However, we
cannot use these codes in an exchange format (or at least, it is
considered bad practise to do so). This is why I think my idea of
using 2 codes in the STAGO and STAGC classes would be good: they could
be handled in 4 possible ways:

1) Have the parser understand them, and the tags they imply.
2) Remove them from the stream altogether.
3) Remove them from the stream and reunite them with the data
4) Replace them with an application specific code.

Note that all of these are *implementation* issues. If one uses
Unicode, directionality codes are not required.

>But there are practical considerations: how does an author put one of
>these "direction change" characters into a document? I suppose the
>issues are already addressed in existing multilingual composition
>interfaces, and we just need to find a reasonable representation of
>the idioms.

There is no need for a "direction" code: this can be understood from
the characters themselves. If one uses Unicode, you *do* need a way to
disambiguate some glyph images though.

I am not besotted with Unicode, but it is a good 95% solution
multilingual issues, whether at the parsing, display, or transmittal
level, *especially* for SGML. Let's use it.