Formal model of SGML/MIME charsets

Daniel W. Connolly (connolly@hal.com)
Wed, 8 Feb 95 14:13:21 EST

Someone [Murray?] requested that I repost this message to html-wg.

It seems relavent to the discussion of SGML character sets and
encodings.

------- Forwarded Message

Message-Id: <9501190024.AA17188@ulua.hal.com>
To: Erik Naggum <erik@naggum.no>
Cc: James Clark <jjc@jclark.com>, sgml-internet@ebt.com
Subject: Re: Content-types for sgml documents
In-Reply-To: Your message of "18 Jan 1995 17:59:35 GMT."
<19950118.10445@naggum.no>
Date: Wed, 18 Jan 1995 18:24:32 -0600
From: "Daniel W. Connolly" <connolly@hal.com>

In message <19950118.10445@naggum.no>, Erik Naggum writes:
> the character set stuff is
>perhaps the most complicated issues with interchanging SGML documents

I heartily concur!

I find discussions of character sets, especially relating to SGML,
confusing because they often use the term 'character set' to describe
something that I would more likely call a function or mapping.

Let me carefully explain the situation as I see it, in mathematical
terminology that I am familiar with.

Let C be the set of all characters, where a character is defined
(informally) as an atom of human communication.

A character repertiore is a subset of C.

Let N be the set of non-negative integers.

Let O be the set of octets: { 0, 1, 2, ... 255 }

Denote a sequence over some set s as SEQ{s}.

A character set is a mapping from some subset of N onto a character
repertiore; i.e. given a character set, each element of its repertoire
has a representation as a number.

A character encoding is a (partial) mapping from SEQ{O} onto SEQ{r},
for some repertoire r. That is given a character repertiore and an
encoding for that repertiore, every sequence of characters has a
representation as a sequence of octets.

For example, the character set "ISO 646:1983//CHARSET International
Reference Version (IRV)//ESC 2/5 4/0" maps 65 to 'A', 13 to
carriage-return, and so on. The character encoding known in MIME as
US-ASCII coincides with ISO646 so that

US-ASCII(first . rest) = ISO-646(first) . US-ASCII(rest)

An SGML document is a set of entities, including a text entity known
as the document entity. A text entity is a sequence of characters.

The document entity contains an SGML declaration (we omit the case of
an implicit SGML declaration for clarity). That SGML declaration
declares the document character set; that is, a character repertiore,
and a corresponding number for each element of that repertiore.

[The SGML standard says that a character is 'encoded' in a bit
sequence of fixed length. I take that to mean that you could write the
character numbers of each of the characters in a text entity in
binary, using the same number of binary digits for each number. Given
that this is independent of the actual machine representation of an
entity, I don't see what it has to do with the price of tea in china.]

As a somewhat contrived, but hopefully illustrative example, consider
an SGML document entity whose document character set is EBCDIC,
encoded as US-ASCII for MIME transport.

(Assume, for the sake of argument, that the character repertoire
defined by EBCDIC (that is, the range of the mapping) is a subset
of the ASCII character repertoire. I'm not sure if this is true
or not. Assume also, that 'A' is character number 100 in EBCDIC.)

It looks something like:

Content-Type: application/sgml; charset=US-ASCII

<!SGML "ISO 8879:1986"

CHARSET "... formal name for EBCDIC ..."
...
>
<!DOCTYPE book>
<book>
<x>A B C</x>
<y>&#100; &#101; &#102</y>

Then, for example, the markup "<!SGML" would be _encoded_ for
transport as octets 0x32, 0x21, 0x53, 0x47, 0x4d, 0x4c, as per
US-ASCII. But the elements x and y above have _exactly_ the same
content; that is, to understand the markup &#100; we consult the
_document character set_, EBCDIC, not the MIME charset= parameter.

So the above SGML delcaration in the above MIME body can be translated
to a sequence of characters and interpreted to discover the document
character set; then, the sequence of characters can be rendered in the
system's representation of an text entity in the EBCDIC character set.
The resulting representation can be consumed by the system's entity
manager, and the document can be parsed.

In fact, as James Clark explained:

> no character is allowed
>in the SGML declaration that does not have a representation in ASCII.
>This follows from the requirement in clause 13 (at 451:6-8) that only
>markup characters (in the reference concrete syntax) and minimum data
>characters are allowed in the SGML declaration.

Hence every SGML declaration can be represented as a MIME body with
content type "text/plain; charset=US-ASCII". Not every _document_ can
be represented as such; some document will have to have their
declaration in a separate body part from the prologue and instance.

Now... the proposal:

In practice, I think it will be too costly to require MIME/SGML user
agents to interpret SGML declarations. I suggest we take advantage of
the fact that a MIME charset= designation determines a character
repertiore, and adopt a convention that maps each character repertiore
with a document character set.

For example, if the charset= parameter of an sgml body is US-ASCII,
then we use the ISO character set with the matching character
repertoire: ISO646-IRV. Any sgml body tagged charset=US-ASCII must
have ISO646-IRV as its document character set.

Unicode-1-1-UCS-2, Unicode-1-1-UTF-8, and Unicode-1-1-UTF-7 (did I get
those right?) all indicate the same character repertoire; any sgml
body so tagged must have a document character set of ISO-10646 (did I
get that right?).

Does this make sense? I hope so. If it doesn't, I'm going to have to
start all over learning about character sets in SGML and MIME.

Dan

------- End of Forwarded Message