Re: New DTD (final version?)

Gavin Nicol (gtn@ebt.com)
Wed, 8 Feb 95 11:44:10 EST

Dan writes:
>This character set stuff is unbelievably obscure. I think there are
>about three people on the planet that understand it (Charles Goldfarb,
>Erik Naggum, and James Clark), and I don't think their understandings
>agree with each other.

Ahh, It's not all that bad. Rick Jellife is another expert, and
somewhat more attune to Asian processing issues.

> 2. Construct an SGML declaration by taking the html.decl
..
> Anyone who knows enough about SGML character sets
> is welcome to step forward and give counterparts

BASESET "ISO 10646//CHARSET UCS-2//EN"
DESCSET 0 65536 0

> to Unicode-UCS-2, Unicode-UTF-8, Unicode-UTF08

These are just different encodings of the same character set...

> 3. Convert the octets to characters via the specified charset.
> For each line of text (delimited by CRLF, that is octets 13,10),
> put an SGML record start character (as per the SGML
> declaration from step 2) at the beginning, and an SGML record
> end character at the end.

You cannot arbitrarily say 13,10. As you later note, step 2 above
might generate something else... if there is one thing which is very
obscure about SGML, it is RS+RE...

> 5. Now you're done. You have a complete SGML document entity:
> an SGML declaration, a prologue, and an instance. Parse
> as per ISO-8879.

Would that it is so easy! You cannot parse the entity if all you
do is replace the BASESET! What about LCNMSTRT, SHUNCHAR, SEPCHAR etc?
You must also generate the character classes for the characters
enountered or you cannot parse the document, and you must be able to do
this for every character set supported. That implies multiple large
tables... or if you use ERCS, with the above BASESET and DECSET,
and use Unicode for character identification purposes, you can
specifify the character class for all characters once (the raw table
an application needs to classify a Unicode character is 94k or so, and
it could be compressed at the expense of a range test). I should note
that DSSSL requies that applications be able to identify characters
according to ISO 10646 or ISO 6429 (section 3.1.3)

This does *not* imply that Unicode data must be used, but only that
the parser parse the document using the ERCS for character
identification and categorisation purposes. For example, one could use
DECSET to supply mapping (either to 10646 or from it) and SYNTAX to
refer to ERCS (using it's public identifier).

In practise, it is probably easier to use ERCS to define the
character class tables, and optimise a parser for it. One could then
use table driven conversion for things like EUC, SJIS etc. For
example, in current clients, we have roughly:

-data--->[decoder]-->[categorisation]-->[parser]-->[display]

and if we want to support a new character set, we have to modify the
decoder, the categorisation subsystem, and quite probably the parser
and display subsystem too.

If one uses ERCS though, one gets:

-data--->[decoder]-->[normaliser]-->[categorisation]-->[parser]-->[display]

where the [normaliser] is in fact a table lookup (the tables for SJIS
and whatnot can be compressed down to 30k or so, and in fact, like
DECSET, the tables could go either way). Now if we want to add support
for a new character set to the above, we get:

-sjis--->[decoder]-->[normaliser]-->[categorisation]-->[parser]-->[display]
-big5--->[decoder]-->[normaliser]/

I won't go into excruciating amounts of detail, but rather, let the
figures speak for themselves, and let inventive minds think of the
optimisations one could perform on the above.

In short: if we want to send out 2.0, and not address character set
issues, fine. Let's leave it ISO8859-1 centric, and have an SGML
declaration, and text in the standard to reflect the fact.

If we do want to define a single SGML declaration that is locale
independent, and which offers a clean, and efficient, implementation
strategy, let's just use ERCS. I have not seen, and very much doubt I
will ever see, anthing superior (sp's model is *much* more
complicated). As noted before, this does *not* mean we are forcing
Unicode use either.

I should note that this fits in very nicely with my earlier proposal
of having Unicode be the lingua franca of the WWW. I have *never*
stated that Unicode is the only character set we should support. Using
the above architecture, it should be easy for browser implementors to
support multiple character sets, of which Unicode would be but one
(and before people moan about how hard it is to implement Unicode, and
lack of fonts, I'd like to point out that actually, all the stuff *is*
out there if you look hard, and are willing to spend time integrating
it).

I still feel that Unicode is a perfect lingua franca for the WWW. Just
as ISO 8859-1 spurred WWW development by not isolating the European
community, so Unicode can bring together the World at large.

To that end, I stand behind my earlier proposal that all browsers
should at least be able to parse UCS-2, and UTF-8 (dropping UTF-7),
and that all servers should be able to convert to it as required (and
again, the mapping is trivial).

Let's forget it for 2.0, but we *must* solve this problem in 2.1, and
any handwaving at all will be both inexcusable, and irresponsible.
The 2 things we should put into place now are the charset=xxxxx
parameter on the content type specifier, and the Accept-Charset:
parameter in HTTP.

It might also be a good idea to take the list of IETF-registered
character sets, and describe what would be required to allow HTML data
to be parsed in an SGML conformant manner, and make it available
seperately from the actual spec. Perhaps someone could volunteer to
coordinate the effort (I don't have time...),

--
Gavin "Do I get gadfly status yet?" Nicol
NOT speaking for EBT!