Re: partial draft: "Character Set" Considered Harmful

Glenn Adams (glenn@stonehand.com)
Wed, 12 Apr 95 11:26:07 EDT

Date: Wed, 12 Apr 95 08:30:33 EDT
From: connolly@w3.org (Dan Connolly)

I'll infer that we have a document D1, whose document character
set is the coded character set ISO-8859-1, and whose document
entity DE1 is represented as a sequence of octets OS1 using
the ISO-8859-1 character encoding.

I think by DE1 above, you really mean S1, where S1 is the storage object
which represents the entity D1. The entity D1 is already in the document
character set in the sense that when its representation (as S1) is parsed,
characters denoted by coded representations in S1 are to be interpreted in
as their equivalent character in D1.

While D1 is an abstraction expressed in terms of the document
character set which applies to D1, S1 is a concrete instantiation of D1
and is expressed in terms of the system (coded) character set being
applied to S1. On systems or in applications which only support one
coded character set then there is only one possible system coded character
set; otherise, S1 could conceivably expressed using multiple coded
character sets. Furthermore, S1 could in fact be expressed using a
character encoding scheme, and not simply a single coded character set.

By the way, I think it is more clear if we use the phrase "character
encoding scheme" than simply "character encoding". The problem with the
latter is that it is easy to confuse "character encoding" with "coded
representation". The addition of "scheme" adds the crucial information;
namely, that we are talking about ann algorithmic transformation method.

I have suggested that we adopt a convention that the charset=X
parameter, i.e. the character encoding, determines the SGML
declaration (or at least the document character set part of it.)

I think this is problematic in 3 ways:

1. it implies the possibility of multiple document character sets, which
means that numeric character references may not be interpreted globally
and that the embedded character numbers must be translated between these
multitple document character sets.

2. it implies that markup could be expressed in multiple ways, e.g., it
would open the door to using non ASCII (and non 8859-1) characters in markup
(e.g., for element names).

3. it makes it necessary to maintain and interchange N SGML declarations
(or at least the character set descriptions contained therein).

I think it is important to recognize that the data being transmitted by
an HTTP "entity-body" is not an entity in the SGML as such sense but a storage
object which represents an entity. Therefore, the CHARSET parameter should
refer to the character encoding scheme (or coded character set, if the scheme
is identical to the principal coded representation form of some coded
character set) which applies to the storage object being interchanged.
This character encoding scheme used in the storage object should be
independent of the document character set, which should be universal if
we can make it so.

Glenn