Re: charset parameter (long)

Albert Lunde (
Sun, 15 Jan 95 16:32:56 EST

(note cross-posting)

At 1:00 PM 1/15/95, Gavin Nicol wrote on html-wg:
>>>Also, do we really want to get into the business of multi-charsets w/in 1
>>Emphatically yes!
>Well, even if we wanted to, we cannot. SGML does not have any way of
>defining that a given bit combination belongs to more than one
>character class. In other words, documents containing multiple
>character sets must be "normalised" *before* then parser sees the
>In my earlier paper I pointed this out, and it is one reason for using
>Unicode. As Larry noted, multilingual documents can be written using a
>coded character set that includes codes for the desired language, and
>in no other way.
>We emphatically *do* want multilingual capabilities, so we must not
>restrict ourselves to US-ASCII or ISO-8859-1, but we most certainly do
>not want multiple character sets per document: that path is a long
>road leading to madness.
>>>I hope not otherwise all the discussion on a header line with the desired
>>>charset for negotiating on a perfered format is for
>>>nothing. (I ask for a document in EUC but it has JIS or SJIS
>>>intermixed; how could I grok those parts?)
>>First thing, the different charsets have to be identifiable, and that means
>No. As I said before. SGML has no (working) way of handling this. The
>data *must* be normalised. Dan has spent a long time making HTML a
>conforming application of SGML, and this would invalidate all that
>effort (as well as making it *very* difficult to write generic SGML
>viewers that could also handle HTML).
>Say "yes" to Accept-Charset:
>Say "NO" to multiple character sets.

I think allowing documents to be in a single character set from:
ISO-8859-X for the same values of "X" allowed in MIME is a fairly
non-controversial extension to HTML/HTTP. (Not for HTML 2.0, but HTML 2.x)

Can we cite some outside source for additional character sets names that
will include Unicode and a reasonable assortment of other national
character encodings not covered by ISO-8859-X, like ISO-2022-JP so we
don't have to act as the body to pick allowed charater sets and wind up
with yet another WWW -specific variation?

It's more important to pick a well-defined name space than to have all
browsers support everything.

I'm not totally convinced that transferring a whole document in a single
encoding, a la Unicode, is the _only_ way to handle multi-lingual
documents, though I'm not an SGML expert and could use some discussion on
this. At least the characters used in tagging need to be mapping in a
single character set before parsing. (This would seem easier in codes that
have US-ASCII as a proper subset.)

Another possiblity would be to define a meta-encoding for multiple
character sets, where the escape codes to shift character sets would not be
represented in _any_ of the character sets. It would then be up to a
multi-lingual HTML implementer to provide a pre-processor to get this
information into a form an SGML parser could deal with (maybe by
normalizing to a combined character set, maybe by adding extra markup)
This does sound less elegant than Unicode, but I'd like to hear more about
why it won't work before ruling it out.

    Albert Lunde