Re: HTML/SGML/charsets

Dan Connolly (connolly@w3.org)
Mon, 3 Apr 95 08:49:53 EDT

Joe English writes:
>
> If MIME agents are allowed to translate message bodies from
> one character set to another (are they? I don't know),
> then this may cause a problem, since all numeric character references
> would have to be translated as well, and MIME does not know about
> numeric character references.

You have missed a subtlety: In the language of the 03 draft I just
sent out, MIME knows about character encodings, not character sets.

Consider a document entity, i.e. a sequence of characters
entity = decode(US-ASCII, octets)

A MIME user agent might translate a document from US-ASCII to EBCDIC
character encoding, so that we have:

decode(US-ASCII, octets) = entity = decode(EBCDIC, octets')

The characters of the document remain the same; hence repertoire would
not change, and hence the document character set need not change, and
numeric character references remain consistent. Hence the markup A
represents an 'A' even though the &, #, 6, 5, and ; characters are
encoded using EBCDIC.

To translate from one character set to another at the SGML level,
we'd have:
entity = lookup(ISO Latin 1, numbers)
(where lookup maps a sequence of numbers to a sequence of characters by
looking each number up in the given character set).

An SGML system might translate ala:

lookup(ISO Latin 1, numbers) = entity = lookup(ISO10646, numbers')

but then numeric character references in the entity might change
meaning; e.g. È originally referred to character 200 in the ISO
Latin 1 character set, which might not be the same as character 200 in
the ISO10646 character set. So, assuming the character repertoire of
the target character set is a superset of that of the source character
set, we can change the numeric character references in the entity to
refer to the right character, using the number given in the
destination character set:

entity' = rewrite(entity, ISO Latin 1, ISO10646)
where:
alpha &N; beta = rewrite(alpha &N'; beta, CS1, CS2)
iff CS1(N) = CS2(N')

Since we're about to deploy a convention that maps MIME charset
parameters to document character sets, and this conversion will
probably assign a different character set to ISO-2022-JIS and
Unicode-UTF-1-1, a complete conversion between those two encodings
would look like:

decode(Unicode-UTF-1-1, octets')
= rewrite(decode(ISO-2022, octets), JIS, Unicode)

This is clearly beyond the scope of MIME. But the change of document
character sets is only motivates by our mapping convention. It is
otherwise unnecessary -- the document character set could remain JIS even though the encoding were Unicode-UTF-1-1

For reference:

character
An atom of information, for example a letter or a number.
Graphic characters have associated glyphs, where as control
characters have associated processing semantics.

character encoding
A mapping from sequences of octets to sequences of characters
from a character repertiore; that is, a sequence of octets and a
character encoding determines a sequence of characters.

character number
A number that determines a character, as per some character set.

character repertoire
A finite set of characters. The range of the mapping defined
by a character set.

character set
A mapping of a subset of the integers onto a character
repertoire. That is, for some set of integers (usually of
the form {0, 1, 2, ..., N} ), a character set and an integer
in that set determine a character. Conversely, a character
and a character set determine the character's number (or,
in rare cases, a few character numbers).

Dan