Re: Comments on: "Character Set" Considered Harmful

James Clark (jjc@jclark.com)
Tue, 18 Apr 95 14:22:25 EDT

[This discussion is fundamental to MIME/SGML, so I've Cc'ed the
MIME/SGML list as well.]

> Date: Sun, 16 Apr 95 23:44:01 EDT
> From: connolly@w3.org (Dan Connolly)
>
> > Dan also defines:
> >
> > text entity
> > a sequence of characters
> >
> > and says:
> >
> > An SGML document is a set of entities, one of which is a text
> > entity called the document entity.
> >
> > It is certainly true that an SGML text entity represents a sequence of
> > characters, but I think there's something fundamental missing here:
> > the entity represents each character by a single non-negative integer,
> > which is mapped onto the character by the coded character set
> > described in the document character set section of the SGML
> > declaration.
>
> I believe this distinction is artificial and unnecessary. The SGML
> standard specifies how to parse characters, not numbers or bit
> combinations. All the stuff about bit combinations can be removed from
> the SGML standard without changing its meaning.
>
> After all, SGML doesn't specify the representation of entities. The
> fact that SGMLS is a conforming SGML system, and yet it munges the
> "bit combinations" stored in unix text files in a way that is
> mentioned nowehere in the SGML standard shows this.

When I say that a character is "represented" by a bit combination, I'm
not talking about how the character is stored on disk; I'm talking
about the interface between the SGML parser and the entity manager.
The information that the entity manager passes to the parser for each
character is the bit combination of that character.

SGMLS is both an entity manager and an SGML parser. The SGML parser
does not interact directly with Unix text files. SGML allows an
entity manager to map the octets in files onto bit combinations in any
way it finds convenient. (SGML doesn't even require that the system
identifier be treated as a filename at all.)

Why should anybody care about the information passed between the
entity manager and the parser? The reason is interchange. Since the
SGML standard doesn't specify the operation of the entity manager, but
it does specify the operation of the SGML parser on the information
that the parser receives from the entity manager, the only reliable
way in general to exchange an entity between diverse SGML systems is
to interchange the information passed between the entity manager and
parser. If I want to send an entity to you, then

- I use my entity manager to map the octets in the file onto the
sequence of bit combinations that my parser would get from my entity
manager;

- I then send that sequence of bit combinations to you;

- you then store the sequence of bit combinations as octets in a file
by doing the reverse of the mapping that your entity manager normally
does to octets in a file in order to get bit combinations to pass to
the parser.

By using this process, I can send you an entity and be guaranteed that
you'll get the same result from your parser as I got from mine without
knowing anything about your entity manager provided that you have a
conforming SGML system.

> It's only a small stretch to imagine an entity manager that could
> represent the characters as pantone colors or sound waves. The
> representation of a character is completely out of scope for SGML.
> The stuff about bit combinations and graphic code sets is just noise.
>
> This is not to be confused with the correspondence between characters
> and numbers via the document character set: even if my document entity
> were represented as a sequence of pantone colors, there would still
> be a color that represents each of '&', '#', '6', '5', and ';', and
> the sequence of those colors would be markup that is equivalent
> to the color corresponding to 'A' (assuming the default SGML declaration).

This seems to be the root of our disagreement. You appear to regard
the mapping from numbers to characters that the document character set
section of the SGML declaration describes as affecting only the
interpretation of numeric character references. I claim that this
mapping is also used to map the bit combinations of the entity onto
characters.

The document character set is defined in the SGML standard to be:

The character set used for all markup in an SGML document, and
initially (at least) for data.

I do not understand how, with your interpretation, the document
character set can be said to be "used for all markup".

> > So I would prefer to say something like:
> >
> > text entity
> > a sequence of non-negative integers, each of which is
> > mapped onto a character by the coded character set
> > described in the document character set section of the SGML
> > declaration.
>
> I don't understand this definition: do you mean that a text entity is
> a sequence of integers, or a sequence of characters, or a document
> charcter set, or some combination of them (such as a tuple)? "the SGML
> declaration" of what? Using definite description phrases with no
> explicit scope is something I'd rather avoid.

The term "entity" tends to get used (or at least I find I tend to use
it) to describe two related but slightly different things:

1. The information that the entity manager gives the parser in return
for a system identifier.

2. The thing declared by an entity declaration in an SGML document.

I'll use entity(1) and entity(2) to refer to these.

I would define an entity(1) as just a sequence of bit combinations.
An entity(1) can be used to represent both text and non-SGML data.

An entity(2) is an entity(1) with some additional properties (like a
name).

An SGML text entity is an entity(2) which has (inter alia) the
property that the bit combinations of its entity(1) are mapped onto
characters using the document character set described in the SGML
declaration of the document in which the entity(2) was declared. Thus
an SGML text entity can considered as a sequence of characters.

> In any case, I'd argue that the text entity is the characters, not
> their representation. The representation of characters doesn't
> necessarily have to have anything to do with the document character
> set. I can have a document, stored on disk using the ASCII character
> encoding scheme, whose SGML declaration says that the document
> character set is EBCDIC.

You can, but only if the entity manager performs an ASCII->EBCDIC
translation in the course of mapping the octets of the storage object
onto the bit combinations of the entity(1). In other words, the
transformation format includes an ASCII->EBCDIC code translation.

James