Re: charset parameter (long)

Gavin Nicol (gtn@ebt.com)
Sun, 15 Jan 95 13:00:54 EST

>>Also, do we really want to get into the business of multi-charsets w/in 1
>>document??
>
>Emphatically yes!

Well, even if we wanted to, we cannot. SGML does not have any way of
defining that a given bit combination belongs to more than one
character class. In other words, documents containing multiple
character sets must be "normalised" *before* then parser sees the
data.

In my earlier paper I pointed this out, and it is one reason for using
Unicode. As Larry noted, multilingual documents can be written using a
coded character set that includes codes for the desired language, and
in no other way.

We emphatically *do* want multilingual capabilities, so we must not
restrict ourselves to US-ASCII or ISO-8859-1, but we most certainly do
not want multiple character sets per document: that path is a long
road leading to madness.

>>I hope not otherwise all the discussion on a header line with the desired
>>charset for negotiating on a perfered format is for
>>nothing. (I ask for a document in EUC but it has JIS or SJIS
>>intermixed; how could I grok those parts?)
>First thing, the different charsets have to be identifiable, and that means
>tagging.

No. As I said before. SGML has no (working) way of handling this. The
data *must* be normalised. Dan has spent a long time making HTML a
conforming application of SGML, and this would invalidate all that
effort (as well as making it *very* difficult to write generic SGML
viewers that could also handle HTML).

Say "yes" to Accept-Charset:
Say "NO" to multiple character sets.

---
Gavin "Not speaking for EBT" Nicol