Re: partial draft: "Character Set" Considered Harmful

Dan Connolly (
Sun, 9 Apr 95 08:52:20 EDT

Gavin Nicol writes:
> > for
> >the purpose of this thread at least, can we use the termnology set
> >forth in the draft?
> Fine. I understand that the draft is aimed at establishing a
> generally useful set of definitions. The set of definitions is not
> overly useful unless they can be used to guide practise however.
> I understand the difference between a coded character set and a
> character set quite well. The 3 points I wish to make are:
> 1) That an encoding maps octets to characters in a character set.

You persist in using the ill-defined term "character set." Until you
supply a rigorous definition of this term, I am unable to evaluate
your conjectures.

> >Please define "escape sequences."
> A control sequence where the first character is ESC (Goldfarb
> 267:18). A control sequence is "a sequence of characters beginning
> with a control character that controls the interpretation,
> presentation, or other other processing of characters that follow
> it". (Goldfarb 259:20).
> >It makes no sense to speak of replacing the characters in an entity
> >with escape sequences. That would be splicing octets into a sequence
> >of characters.
> Except that numeric character references do much the same thing. If
> SGML used ESC [ NNNN ; instead of &#NNNN; how do they differ? The
> SGML parser never sees these things.

Whoa... first, I was speaking of an esacpe sequence as a sequence of
octets in the input of an encoding. You're using it to mean a sequence
of characters, the first of which is an escape character. Fine. I'll
adopt your definition for now. But in this case, the SGML parser does
see ESC [ NNN, just as it sees &#NNN;. These are both forms of
markup. (The &#NNN; is anyway...) The input to the parser is a
sequence of characters, no?

> >> As I noted in my paper, if we assume that documents encoded using
> >> ISO 8859-1 also have a document character set of ISO 8859-1, then
> >> umeric character references will be forever broken.
> >
> >Broken in what sense? The answer to "how do I put unicode in a
> >charset=ISO-8859-1 message entity?" is: "Don't do that, at least in
> >the HTML world." If an HTML user agent is capable of dealing with
> >Unicode, then it will must support an encoding such as UCS-2 or UTF-8.
> Broken in the sense that there will be no mechanism at all that allows
> the character codes used in the numeric charater references to be
> invariant across encodings *unless* the document character set is also
> specified. I do not think this behaviour is desirable

Ok... so we've changed from "broken" to "undesireable." I cannot
dispute that you find this behaviour undesirable. I don't think it's
desireable to require all SGML documents (not just HTML) sent over the
net from this day forward to use ISO 10646 exclusively. Consider the
cost of re-writing numeric character references in existing SGML
documents, for example.

> (actually, I
> only object to the cases where there is no definition of the document
> character set).

In what case is there no definition of the document character set? Any
representation of SGML in MIME must specify a mechanism for
determining the document character set. My proposal for HTML is to map
charset=ISO-8859-1 and charset=US-ASCII (implicit or explicit) to the
ISO-8859-1 document character set, and to flesh out this mapping
convention over time.

> >>To overcome this, my proposal is centered around making ISO 10646 the
> >>document character set for HTML
> >
> >This doesn't sound like a good idea at all, to me. It's too draconian.
> >I suggest we let folks use what they like, and see if Unicode comes
> >out as the most popular coded character set or character encoding by
> >its own merits.
> Dan, I gather that you had not read my paper at this point. I don;t
> think it appropriate to comment on it without reading it.

I read the paper. I object to two issues in your paper:

* your proposal requires all HTML documents to have a document
character set of ISO 10646. I believe this is gross overspecification.

* you use the same lack of precision in your discussion of characters
and their encodings as all the other ISO documents that got us
in the current mess.

> By using ISO 10646 as the document character set, we are providing an
> abstract processing model that parsers should respect. That is
> all. Implementations are free to interpret the model, or restrict
> themselves, so long as the rules of SGML are not broken, and SGML
> allows a wide range of accepted behaviour.

My proposal also provides an abstract processing model that parsers
should respect. It achieves the same goals without eliminating the
possibility of using, e.g. EBCDIC or ISO-8859-5 as a document
character set.

> This discussion should probably be taking place in html-wg.

I've copied html-wg on this.