Re: charset parameter (long)

Gavin Nicol (gtn@ebt.com)
Sun, 15 Jan 95 13:18:41 EST

>>Trying to handle character-encoding at the SGML level is a mistake,
>>because the SGML tags themselves are represented in the overall charset.
>
>Actually, only in the ASCII subset of the overall charset, which buys you
>some freedom.

In *HTML* this is true. In *SGML* (in which HTML is defined), this is
not. One reason is to allow people to mark up text using their native
language. Imagine if I forced you to mark up all document using Kanji
tags....

>If ASCII remains ASCII (same bit pattern) across charset
>switches, you're safe, your parser will still recognize your tags.
>Otherwise, I agree that there is a problem, that needs to be addressed at
>*some* level.

Multiple character sets must be handled *before* the parser even sees
the characters.

BTW. I've been thinking about how best to handle presentation hint
data in my proposal, and one interesting idea I had was to use 2 codes
from the Private Use Area, and say that they belong to the STAGO and
STAGC classes (like < and >). In SGML, these then *could* be tags in a
DTD, or as is current practise on the WWW, they could be ignored (or
perhaps removed in other systems). As such, we could have the
equivalent of <japanese></japanese>, but we would also have the
ability to unambiguously recognise the tags, and perhaps remove them,
before the parser even saw them. In fact, there was an RFC recently
that defined tags used for languages. Perhaps they could be used for
GI's?

This assumes we're using Unicode for multilingual documents, and
something like the ERCS for the parser, of course...