Re: partial draft: "Character Set" Considered Harmful

Dan Connolly (connolly@w3.org)
Wed, 12 Apr 95 08:30:48 EDT

Gavin Nicol writes:

> So far the 2 things you have against my proposal are:
>
> >* your proposal requires all HTML documents to have a document
> >character set of ISO 10646. I believe this is gross overspecification.
> >
> >* you use the same lack of precision in your discussion of characters
> >and their encodings as all the other ISO documents that got us
> >in the current mess.
>
> Opinion, and something I would also agree with. For the latter, I
> would just like to say that I did note that it was a very early draft,
> and stated that the wording should not be considered, only the
> concepts.

I believe we are reaching agreement. As to the first item, It looks like
noone has a requirement to exchange HTML documents
using anything but ISO10646 or its subsets as a document character
set.

*** Does everyone understand that they're giving up a
*** certain amount of flexibility here?

Hence your proposal, to universally adopt Unicode, is sufficient,
and simpler than my proposal to let the document character set
vary as a function of the encoding. In the interest of "everything
should be as simple as it can be, and no simpler," I support it.

> Please explain to us all what will be necessary to build a system that
> supports more than 10 coded character sets and encodings using a
> document character set implied from the charset= parameter.

Actually, I'd prefer to defer most of those issues to your paper :-)

The "Character Set" Considered Harmful document does not set
out to solve all the problems of deploying multiple character sets
on the web. It merely proposes a terminology for talking about
such specifications and proposals, and attempts to motivate,
justify, and explain the terminology. It supports the conjecture
that this terminology is internally consistent and sufficiently expressive
for the problems we want to talk about.

But until folks can read and understand the draft, it needs to be
refined. As to my second objection:

> For the latter, I
> would just like to say that I did note that it was a very early draft,
> and stated that the wording should not be considered, only the
> concepts.

It is precisely because of the wording that I am unable to evaluate
the concepts.

In the interest of improving both documents, I'll rephrase and
answer your questions in the terminology that I have proposed:

> >> Your proposal will make it impossible to use numeric character
> >> references without facing the risk of them having different meanings
> >> in different browsers,
> >
> >Only as a result of defects in browsers. Browsers whos implementations
> >are consistent with the model will interpret text representations
> >consistently. Arguments (other than argument by assertion, as a bove)
> >to the contrary are welcome.
>
> Say I have a document in ISO-8859-1, and another in UNICODE-1-1-UTF-8,

Are you referring to the coded character set ISO-8859-1, or the character
encoding scheme ISO-8859-1? Your question is not clear. I'll infer that
we have a document D1, whose document character set is the coded
character set ISO-8859-1, and whose document entity DE1 is represented
as a sequence of octets OS1 using the ISO-8859-1 character encoding.

DE1 = ISO-8859-1 (OS1)

Unicode-1-1-UTF-8 is a character encoding. So I infer that we have a
second document D2, whose document character set is the coded
character set ISO 10646, and whose document entity DE2 is represented
as a sequence of octets OS2 using the Unicode-1-1-UTF-8 character
encoding.

DE2 = Unicode-1-1-UTF-8(OS2)

> and they both contain ҽ how should this be interpreted?

The sequence of characters ҽ in DE1 is an error: 1213 is not
in the domain of the coded character set ISO-8859-1.

The sequence of characters ҽ in DE2 refers to ISO-10646(1213),
that is, the character at code position 1213 in the coded character
set ISO10646.

> How about if I have a document in ISO-8859-1, another in
> UNICODE-1-1-UTF-8, and another in X-JIS0201, and they all contain
> ª?

By "I have a document in X," I take it you mean "I have a document whose
document entity is represented as a sequence of octets using the character
encoding X." So:

DE3 = ISO-8859-1(OS3)
DE4 = UNICODE-1-1-UTF-8(OS4)
DE5 = X-JIS0201(OS5)

To interpret the markup ª we need to know the document character
set of the respective document. In general, the sender will have to
attach, include, or reference an SGML declaration that will specify the document
character set. In the specific case of HTML, I have suggested that we
adopt a convention that the charset=X parameter, i.e. the character
encoding, determines the SGML declaration (or at least the document
character set part of it.)

So the document character set of DE3 is the coded character set ISO-8859-1.
ª refers to the ISO-8859-1(170). The document character set
of DE4 is ISO10646, so &#170 is ISO10646(170). As we learned in recent
messages on this list, the two character sets agree on 1-255, i.e.:

For all 0 < x < 256, ISO-8859-1(x) = ISO10646(x)

I don't yet know what document character set X-JIS0201 would indicate.
But let's call it CS5, and let's assume its domain includes 170. Then
in DE5, &#170; refers to CS5(170).

> If I want to change the encoding what needs to happen?

Theorem: If
R1 subset R2
CE1 : SEQ(octet) -> SEQ(R1) is onto
CE2 : SEQ(octet) -> SEQ(R2) is onto
DE = CE1(OS1)
Then there exists OS2 such that
DE = CS2(OS2)
Proof:
DE is in SEQ(R1), and R1 subset R2, so DE is in SEQ(R2).
CE2 is onto, so for any member of SEQ(R2), in particular DE,
there esists a sequence of octets OS such that CS2(OS) = DE.
Let OS2 = OS. QED.

If we let CE2 = Unicode-1-1-UTF-8, and CE1 = X-JIS2022,
R2 = the range (repertoire) of ISO10646, and R1 = the range (repertoire)
of CS5, then provided R1 subset R2 (which I believe it is, since ISO10646
includes "everything") the above theorem shows that there
is a sequence of octets OS5' such that DE5 = Unicode-1-1-UTF-8(OS5').

This is an existence proof. The constructive proof, i.e. source code
to do the translation, is left as an exercise to the reader (hint: check
the gnu recode utility, or James clark's SP).

> What
> needs to happen if I change the coded character set, as well as the
> encoding?
> How should we deal with documents using ISO-2022-JP?
>

I could go on in similar detail to formalize this quesion and give
and answer, but I'm getting tired of this. If anybody _really_ wants
me to, I'll do it. But I think I've proved my point.

The "Character Set" Considered Harmful needs to be expanded (and
in fact I plan to use Gavin's questions and my answers as examples
in the draft) so that folks can understand the model.

But I maintain: the model is internally consistent, and sufficiently
expressive.

It is also consistent with the proposal that everybody use Unicode.
Using Unicode is a sufficient, but not necessary mechanism. In
the MIME-SGML world, I don't believe "Everyone must use Unicode"
is an acceptable solution. For HTML, it appears to be.

You have a nice day too!

Dan

p.s. I'm in Germany, and I'm somewhat bandwidth-challenged. So It
may be a week or two before I can revise the "character set" considered
harmful document.