Re: Charset parameter

Gavin Nicol (gtn@ebt.com)
Mon, 12 Dec 94 20:34:42 EST

>One approach, I am investigating, involves defining a psuedo character set
>in the SGML declaration, and assigning these character codes on the fly.
>These codes are used as offsets into an array of character info which holds

There is a proposal going to SGML Open from a fellow in Australia that
might be of interest to you. The proposal outlines an Extended
Concrete Synatax that defines a 16bit CHARSET.

The core concept is that at the lowest level in the parser, you have a
"normalizer" which converts from the data storage format into the
document character set. This is roughly akin to my proposal, but
generalises it so that it *should* be possible to mix encodings, and
character sets, and let the normaliser take care of all the nasty
details.

Your proposal is roughly equivalent to mine: I suggested using UTF-8,
(or UTF-7) automatically generated from the native encoding. The
generated data would include automatically generated language and
presentational hints. The parser would see 16 bit data only. From what
I can see, SGML is targetted toward fixed width characters, and
encodings like ISO2022. It is probably not possible to handle certain
encodings at all using the SGML declaration. I should take some time
and write up my proposal in far clearer terms....

James Clark's SP is also very good. He takes a similar approach, and I
think his approach is the one recommended by most SGML groups.