Comments on: "Character Set" Considered Harmful

Dan Connolly (connolly@w3.org)
Sun, 16 Apr 95 23:44:11 EDT

James Clark writes:
> Dan has the following definitions:
>
> coded character set
> A function whose domain is a subset of the integers, and whose range
> is a set of characters.
>
> [Perhaps "non-negative integer" instead of "integer"?]

I could make this distinction, but I don't believe I make use of it,
so it seems unnecessary. I could also say "finite". I'll have to
look this stuff over and see if there are hidden assumptions that
numbers are non-negative or sets are finite...

> I would also propose:
>
> transformation format
> a function whose domain is the set of sequences of octets, and whose
> range is the set of sequences of non-negative integers

Sounds good. It does complete some puzzles. In the interest of
"sufficiently expressive," I'll try to work it in to the draft.

> Dan also defines:
>
> text entity
> a sequence of characters
>
> and says:
>
> An SGML document is a set of entities, one of which is a text
> entity called the document entity.
>
> It is certainly true that an SGML text entity represents a sequence of
> characters, but I think there's something fundamental missing here:
> the entity represents each character by a single non-negative integer,
> which is mapped onto the character by the coded character set
> described in the document character set section of the SGML
> declaration.

I believe this distinction is artificial and unnecessary. The SGML
standard specifies how to parse characters, not numbers or bit
combinations. All the stuff about bit combinations can be removed from
the SGML standard without changing its meaning.

After all, SGML doesn't specify the representation of entities. The
fact that SGMLS is a conforming SGML system, and yet it munges the
"bit combinations" stored in unix text files in a way that is
mentioned nowehere in the SGML standard shows this.

It's only a small stretch to imagine an entity manager that could
represent the characters as pantone colors or sound waves. The
representation of a character is completely out of scope for SGML.
The stuff about bit combinations and graphic code sets is just noise.

This is not to be confused with the correspondence between characters
and numbers via the document character set: even if my document entity
were represented as a sequence of pantone colors, there would still
be a color that represents each of '&', '#', '6', '5', and ';', and
the sequence of those colors would be markup that is equivalent
to the color corresponding to 'A' (assuming the default SGML declaration).

> So I would prefer to say something like:
>
> text entity
> a sequence of non-negative integers, each of which is
> mapped onto a character by the coded character set
> described in the document character set section of the SGML
> declaration.

I don't understand this definition: do you mean that a text entity is
a sequence of integers, or a sequence of characters, or a document
charcter set, or some combination of them (such as a tuple)? "the SGML
declaration" of what? Using definite description phrases with no
explicit scope is something I'd rather avoid.

My original defintion can be written more formally as:

X is a text entity iff X is a sequence of characters
and expanded as:
X is a text entity iff there exist a non-negative N and
set R such that:
X : {0,1,2,...,N} -> R

(well... it should say "finite sequence of characters", and the axiom
of set comprehension usually requires that the R set be a subset of
some previously established set, to avoid Russel's paradox. So I need to
postulate a great big set called "the set of characters.")

In any case, I'd argue that the text entity is the characters, not
their representation. The representation of characters doesn't
necessarily have to have anything to do with the document character
set. I can have a document, stored on disk using the ASCII character
encoding scheme, whose SGML declaration says that the document
character set is EBCDIC. Any numeric character references would be
resolved via EBCDIC.

The representation might be colors, or sounds, or printed glyphs,
or... But the SGML standard is best specified and understood in terms
of the characters themselves.

> In general, I don't think it's
> reasonable to restrict element and entity names to US-ASCII.

In general (and hence for MIME/SGML), I agree. But for HTML, this is
evidently a reasonable restriction (see recent postings in support of
deployment of Unicode).

Dan