Re: ISO/IEC 10646 as Document Character Set

Martin J Duerst (
Thu, 4 May 95 05:43:40 EDT

Glen Adams wrote:

> From: Larry Masinter <>
> Date: Wed, 3 May 1995 19:43:30 PDT
> > 1. We need language in the RFC that specifies what to do in the default
> > case that the CHARSET parameter is not present in the Content-Type
> > response.

I don't see that big problems for interoperability and upgrading with respect
to Japan. It would only mean that the present solutions in Japan are not
conforming to the standard. But for a while, the Japanese versions of
browsers would leave a switch to change from ISO 8859-1 to "detect
any of the three Japanese encodings", and everything would work
probably all too well (i.e. too well to create enough pressure to change

> Actually, I disagree, if by "the RFC" you mean the HTML RFC. I think
> this is transport dependent. If you get HTML by SMTP mail, the default
> should be what the mail transport says it is (US-ASCII) while if you
> get something by HTTP the default can be something else.
> How about if we don't try to solve this problem in the HTML working
> group.
>You are partly correct in throwing this out of the HTML arena. However,
>HTML spec *should* say something about any assumptions (or lack of
>assumptions) about the character encoded used in the representation
>of the document entity or the collection of entities which make up an
>HTML document.

I see one SERIOUS problem: How does the server tell what encoding a document
is in? Automatic detection is difficult for a wide range of codings. Limiting
solutions to one encoding per server may be too restrictive for the time
being, and hampering future developments, e.g. from sites with mainly
JIS/SJIS/EUC to ISO 10646. Using filename extensions will give big problems
on systems such as MS/DOS, and we would have to device something like
a standard for this. Putting the info somewhere else, e.g. in separate files,
would create a big burden for the writers of HTML documents.

Note that this is a general problem. Basically, every kind of system that
treats several encodings has to deal with it, but there are only very few
attempts at solutions so far.

The most practicable I have seen so far is the one that was presented/
discussed at the recent Omega workshop in Geneva (Omega is the
Unicode extension/redesign of TeX). It works as follows, in very
non-standard language:

a) You know what character a document is starting with. Usually,
you choose the comment starting character for this, and require
that the file starts with a comment. In the case of HTML, that would
be "<", most probably. In that way, you can distinguish between
ASCII-based (starts with '3c'), EBCEDIC-based ('4c'), and Unicode
-based ("003c") stuff (and maybe other major coding systems).
b) You look out for a charset identification. The letters of such an
identification are restricted by design to a very small set of
characters so that you can identify them all once you know which
of the variants identified in a) you are dealing with.

To tell the system which encoding a document is in inside the document
would also be very reasonable from the point of the user/writer, as long
as (s)he doesn't use a sophisticated writing tool that will take care of this.
It is clearly more reasonable than any other solution I can immagine at
the moment that allows the server to identify the document charset.

The situation is somewhat special for mail: There, the MIME header
entries are not part of the mail content, but they are part of the mail
document as a whole, so that still the document speaks for itself.
In HTTP/HTML, we have separated MIME headers from the documents,
but we still need a way for the document to identify its coding.

Regards, Martin.

---- Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: