Comments on: "Character Set" Considered Harmful

James Clark (jjc@jclark.com)
Wed, 12 Apr 95 09:38:25 EDT

I read Dan's paper with interest.

Dan has the following definitions:

coded character set
A function whose domain is a subset of the integers, and whose range
is a set of characters.

[Perhaps "non-negative integer" instead of "integer"?]

character encoding
a function whose domain is the set of sequences of octets, and whose
range is the set of sequences of characters over some character
repertoire.

He also says:

the charset parameter on text/* Internet Media Types refers to a
character encoding, whereas an SGML document character set names a
coded character set.

So far I agree.

I would also propose:

transformation format
a function whose domain is the set of sequences of octets, and whose
range is the set of sequences of non-negative integers

With this definition, the composition of a transformation format and a
coded character set is a character encoding. For example, combining a
transformation format of UTF-7 with a coded character set of ISO 10646
yields the MIME charset UNICODE-1-1-UTF-7. (Although UTF-7 is
typically used with Unicode, it could perfectly well be used with any
16-bit coded character set.)

I think transformation format is an important concept when dealing
with SGML, because it's the thing that the SGML standard is silent on:
whereas the coded character set is specified in the SGML declaration,
the handling of the transformation format is left entirely to the
entity manager.

It also gives us an alternative way to look at Gavin Nichol's
proposal. Given a coded character set, a character encoding
determines a transformation format, provided that for each character
in the character repertoire of the character encoding there is a
unique non-negative integer that is mapped onto that character by the
coded character set (*): the character encoding maps sequences of
octets to sequences of characters and the coded character set can then
be used to map each character back to an integer. Gavin's proposal
fixes the document character set for HTML as ISO 10646, and then uses
the MIME character encoding to tell the entity manager what
transformation format to use. Fortunately, for most (all?) MIME
character encodings, the requirement (*) is met when the ISO 10646 is
the coded character set.

Dan also defines:

text entity
a sequence of characters

and says:

An SGML document is a set of entities, one of which is a text
entity called the document entity.

It is certainly true that an SGML text entity represents a sequence of
characters, but I think there's something fundamental missing here:
the entity represents each character by a single non-negative integer,
which is mapped onto the character by the coded character set
described in the document character set section of the SGML
declaration. So I would prefer to say something like:

text entity
a sequence of non-negative integers, each of which is
mapped onto a character by the coded character set
described in the document character set section of the SGML
declaration.

Dan's paper also says:

But the US-ASCII character encoding is actually sufficient to
represent all SGML documents (except those rare documents that use
characters outside the repertiore of ISO-646-IRV for markup).

They may be rare now, but I would suggest that this is at least partly
due to the limitations of SGML tools. In general, I don't think it's
reasonable to restrict element and entity names to US-ASCII. This
scheme would also have problems when non-ASCII characters occur:

- within comments,

- in the content of elements with a declared content of CDATA, or

- in CDATA marked sections

Numeric character references are not recognized in any of these
contexts.

James