Charset parameter [Was: Tentative Agenda for IETF meeting ]

Daniel W. Connolly (connolly@hal.com)
Fri, 2 Dec 94 14:34:34 EST

In message <94Dec2.094133pst.2757@golden.parc.xerox.com>, Larry Masinter writes
:
>> The group has agreed that the current usage of HTML 2.0 addresses
>> the issue of international character sets rather poorly. However,
>> in order to give proper attention to this important issue, we have
>> decided to address it in future versions of the spec.
>
>You know, I've come to the conclusion that 'international character
>sets' are relatively easy to handle by requiring an additional
>"charset" parameter to the "text/html" MIME type. E.g.,
>
>"text/html; charset=unicode-1-1-utf-7" would be a way of saying 'a
>HTML document using unicode' (as per RFC 1642), while "text/html;
>charset=iso-2022-kr" would identify a HTML document that uses the
>Hangul encoding scheme for Korean as per RFC 1557. The default charset
>depends on the transport mechanism; for HTTP, the default might well
>be "text/html; iso-8859-1".
>
>I'm considering proposing in the HTTP working group adding a
>"Accept-charset: " header for clients to send to servers which
>charsets (other than US-ASCII and ISO-8859-1) that they are willing to
>accept; of course, it is mandatory that servers identify the charset
>of any text/* document which isn't the default; however, this is no
>longer a HTML issue.

I agree that this is the right way to handle the issue of character
_encodings_ in HTML documents. But there is more to it than just that.

Let me try to elaborate the situation as I see it.

Discussing ISO character sets can be tedious and confusing, since the
term "character set" generally refers to a function, not a set, in
conventional math terminology. Let me lay out some terms:

Character: informally: an atom of information; a symbol used in human
communication

Byte: formally: any element of the set {0, 1, 2, ..., 255 }

Character repertoire: formally: a finite set (of characters)

Character set: formally: an intvertible function f: I -> R where R is
a character repertioire, and I is a subset of the non-negative
integers.

Character encoding: formally: an invertible function f: BS -> CS where
BS is a sequence of bytes, and CS is a sequence of characters.

SGML document entity: formally: a sequence of characters, consisting
of an SGML declaration, a prologue, and an instance.

MIME body: formally: a sequence of bytes

MIME body part: formally: a sequence of bytes, consisting of headers
and a MIME body, whose content-type and content-transfer-encoding are
given by the headers.

The MIME RFC says that the charset parameter is actually a character
encoding:

This RFC specifies the definition of the charset parameter for the
purposes of MIME to be a unique mapping of a byte stream to glyphs, a
mapping which does not require external profiling information.

An SGML document begins with an SGML declaration. The SGML declaration
is written using the ISO-646-IRV character _repertiore_. So it has to
start with the characters '<!SGML', but nobody said that those
characters have to be encoded into bytes using the ISO-646-IRV
character encoding. The might be encoded with a double-byte character
encoding, or EBCDIC, or any other scheme.

It is odd, but true, that the SGML standard doesn't prescribe how you
take a sequence of bytes off the disk or network and translate it into
a sequence of characters for parsing.

As a consequence, there is no mechanical way to take a sequence of
bytes which represent an SGML document and determine the encoding you
should use to turn it into a sequence of characters.

Happily, we see that the MIME spec and the SGML spec are orthogonal:
MIME tells you how to take the bytes of the wire and turn them into
characters. SGML tells you how to parse the resulting sequence of
characters.

If the SGML document begins with an explicit SGML declaration, that
declaration will include a specification of the document character
set, which governs the numeric character reference markup, e.g. &#60;.

So it's possible to take an SGML document whose document character set
is ISO-8859-1, and represent it on the disk or network using a
16-bit-per-character unicode encoding. The characters '&#246;' would
occupy six bytes, but they would still indicate an o-umlaut as per
ISO-8559-1, and not character 246 from unicode. Conversely, even
though the encoding allows cyrillic and kanji characters to be
represented, there would be no way to refer to them using numeric
character references.

If you're following me so far, you might realize that the mime body part:

Content-Type: text/html; charset=unicode-1-1-utf-7

<!doctype html "-//IETF//DTD HTML//EN">
<html>
<title>What character is this?: &#2000;</title>
</html>

doesn't specify an document character set. If we infer the same SGML
declaration for this document as we have for HTML documents in the
past, the document character set would be ISO-8559-1, and the markup
'&#2000;' would be nonsense.

Hence we should interpret charset=unicode-1-1-utf-7 to mean a change
in the document character set in the SGML declaration as well as the
character encoding.

I hope that wasn't all terribly redundant. It seems like it needed
saying.

Having thought through it carefully, I agree that the business of
supporting documents which use a character set and encoding other than
ISO-8559-1 throughout is a tractible problem. So we could support
documents using the western-european writing system, documents using
the hebrew writing system, documents using Kanji, etc. in relatively
short order using the Accept-charset: mechanism you describe above.

In fact, it's backwards compatible. Existing browsers already
implement the default case. We could squeeze it into 2.0 just like we
did the ICADD stuff, if someone is willing to commit to the editorial
work.

But some folks are not happy with just supporting other character
sets. They're pushing for multi-lingual documents. Someone posted an
'acid test' to some forum that I can't recall just now. But the acid
test was: Can I cite the bible, the koran, and (some asian work that I
can't recall) all in the same paragraph?

Certainly that's the sort of thing we should push off to HTML 3.0 (or
out of HTML altogether, and into compound documents...).

Dan