Re: charset parameter (long)

Albert Lunde (Albert-Lunde@nwu.edu)
Sun, 15 Jan 95 16:32:56 EST

(note cross-posting)

At 1:00 PM 1/15/95, Gavin Nicol wrote on html-wg:
>>>Also, do we really want to get into the business of multi-charsets w/in 1
>>>document??
>>
>>Emphatically yes!
>
>Well, even if we wanted to, we cannot. SGML does not have any way of
>defining that a given bit combination belongs to more than one
>character class. In other words, documents containing multiple
>character sets must be "normalised" *before* then parser sees the
>data.
>
>In my earlier paper I pointed this out, and it is one reason for using
>Unicode. As Larry noted, multilingual documents can be written using a
>coded character set that includes codes for the desired language, and
>in no other way.
>
>We emphatically *do* want multilingual capabilities, so we must not
>restrict ourselves to US-ASCII or ISO-8859-1, but we most certainly do
>not want multiple character sets per document: that path is a long
>road leading to madness.
>
>>>I hope not otherwise all the discussion on a header line with the desired
>>>charset for negotiating on a perfered format is for
>>>nothing. (I ask for a document in EUC but it has JIS or SJIS
>>>intermixed; how could I grok those parts?)
>>First thing, the different charsets have to be identifiable, and that means
>>tagging.
>
>No. As I said before. SGML has no (working) way of handling this. The
>data *must* be normalised. Dan has spent a long time making HTML a
>conforming application of SGML, and this would invalidate all that
>effort (as well as making it *very* difficult to write generic SGML
>viewers that could also handle HTML).
>
>Say "yes" to Accept-Charset:
>Say "NO" to multiple character sets.

I think allowing documents to be in a single character set from:
ISO-8859-X for the same values of "X" allowed in MIME is a fairly
non-controversial extension to HTML/HTTP. (Not for HTML 2.0, but HTML 2.x)

Can we cite some outside source for additional character sets names that
will include Unicode and a reasonable assortment of other national
character encodings not covered by ISO-8859-X, like ISO-2022-JP so we
don't have to act as the body to pick allowed charater sets and wind up
with yet another WWW -specific variation?

It's more important to pick a well-defined name space than to have all
browsers support everything.

I'm not totally convinced that transferring a whole document in a single
encoding, a la Unicode, is the _only_ way to handle multi-lingual
documents, though I'm not an SGML expert and could use some discussion on
this. At least the characters used in tagging need to be mapping in a
single character set before parsing. (This would seem easier in codes that
have US-ASCII as a proper subset.)

Another possiblity would be to define a meta-encoding for multiple
character sets, where the escape codes to shift character sets would not be
represented in _any_ of the character sets. It would then be up to a
multi-lingual HTML implementer to provide a pre-processor to get this
information into a form an SGML parser could deal with (maybe by
normalizing to a combined character set, maybe by adding extra markup)
This does sound less elegant than Unicode, but I'd like to hear more about
why it won't work before ruling it out.

---
    Albert Lunde                      Albert-Lunde@nwu.edu