Re: Revised language on: ISO/IEC 10646 as Document Character Set

Erik van der Poel (erik@netscape.com)
Wed, 10 May 95 00:34:07 EDT

>Discussion of Unicode has
>been happening in fits and starts for months, but nobody
>has come close to making a proposal that is _more_ comprehensive.
>
>(I'd also suggest that the technical issues differ from MIME,
>because we need a scheme that will work well with SGML.)

Hmmm... I'm not really sure what you mean by "work well with SGML",
but perhaps I could try to guess that you're referring to the following.

SGML parsers parse documents in the "document character set", and
some external process must first convert the document from whatever
character encoding it uses to the document character set. Each character
in the document character set must have a number and must have an entry
in the XXX table (whatever you guys call it), even if only to say "UNUSED".

Now, the problem is that there may be MIME/HTTP "charsets" that cannot
be mapped to 10646 (the proposed document character set). Earlier, I
gave an example: iso-2022-jp. But Glenn pointed out that this example
is incorrect. Having re-checked the JIS X 0201 and 0208 vs Unicode
tables today, I now find that he appears to be correct. My only excuse
is that I haven't looked at those tables for a while, and in the past
they *did* appear to have the problem I mentioned. Sorry.

But let me try to give a different example then. Unicode/10646 was
based on a number of international, national and vendor standards.
However, after they "froze" their repertoire, I believe Taiwan came
out with one or more extended character sets. (Correct me, Glenn,
if I'm wrong.) As far as I can recall, someone even asked about
these new Taiwanese character sets on the net, and Glenn (or some other
Unicoder) answered that Unicode/10646's repertoire was frozen before
those Taiwanese character sets hit the streets.

So, how about defining the "document character set" to be the union
of the "charset" and 10646? Numeric character references could use
10646 codepoints for 10646 characters, and the characters not in 10646
could have other numbers (the actual numbers would be outside the
scope of the HTML spec). Use of non-10646 numeric char refs could be
discouraged or even prohibited, if the WG feels this is necessary.

Or, if I might stand on the Unicode "side of the fence" for the moment,
how about sticking to 10646 for the document character set, and
telling people with new characters to first get them into 10646
(through normal ISO procedures) before using them in "text/html"?
We could say: "if you really want to use that character in HTML,
you'll have to use image/* or get it into 10646 first".

But if people already have fonts for these new characters and simply
want to use them as text, but can't get them into 10646 easily or
quickly (for whatever reason), they'd be disappointed.

>Our problem is not how to encompass that largest possible writing
>system imaginable, our problem is how to write a standard that
>goes beyond ISO Latin-1 and ISO-8859-X. I think the proposed
>direction of using Unicode as the document character set
>combined with a wide choice of MIME encodings does this
>well, and increases the scope of possible characters from
>255 to over 30 thousand. If this turns out not to be sufficent,
>I'm sure we can do an extension mechanism or format negotiation
>to allow for use of a different document character set.

Hmmm... perhaps that extension mechanism would be called "charset"? :-)

>I also wonder a little that more hasn't been said sooner: we
>got to the present proposal out of repeated discussions over
>several months.

Well, I can't speak for others, but I myself only switched to a job
involving HTML at the end of March.

Also, the word 10646 was only recently actually added to the draft,
and the author specifically asked for comments on that wording. So
I submitted comments.

Erik