Re: SGML/MIME charsets, ERCS/Unicode [was: New DTD (final version?) ]

Daniel W. Connolly (
Thu, 9 Feb 95 11:53:39 EST

In message <>, Gavin Nicol writes:
>I find it quite incredible that some people obviously have time to dig
>into obscure places within the WWW, but do not read the mail coming
>through this list...

Guilty. :-{

But thanks to Mr. Lunde's pointers, I went back and read your paper
and the http stuff last night. That's what prompted me to try to
crystalize the problem we're trying to solve. I didn't see a clear
problem statement, and the paper had a "solve all the world's problems
in one fell swoop" slant.

> and then try to offer comment on issues that have
>been discussed before.

Say... you asked for commentary, no?

>>I didn't arbitrarily say 13,10. It comes from the MIME spec.
>>Once you use the MIME text/* rules to determine the lines, you then
>It was long ago decided in the HTTP working group that HTTP does not
>require strict conformance to MIME in this area. It was also noted
>that this is not in the MIME *standard*, but rather in a draft being

So we can agree that we haven't seen the last word on this. I think
that it's silly for HTTP to fail to interoperate cleanly with MIME.
I think there will be some give and take on both sides... we'll see.

I'd say if you want to use UCS-2, call it application/html.
It's a low cost option that keeps everybody happy, no?

> my paper (posted to this group but obviously
>not read).

Until now. Sorry, but you snuck it in right around the Christmas
holiday -- I was out of town, and I'm just now catching up.

>To a certain degree, yes. If one only changed the BASESET, you'd
>effectively be limiting yourself to character sets in which ASCII is a
>subset (admittedly a large group). However, we also open ourselves to
>problems: for example, wide spaces in SEPCHAR areas, and SHUNCHARS.

I'm willing to stipulate to all of the above for HTML (but to be
clear: we're talking about character sets whose repertiore contains
the repertiore of the ASCII character set: all the characters have to
be there, but they don't have to have the same numbers.)

>One way or another, are going to be relying on the browser writers
>to "get it right", and they probably won't.

I maintain that if we specify "it" sufficiently clearly, they will.
Only time will tell.

> Even if they do, it'll
>mean that (potentially) every character set supported will require
>changes to the parser, and that is probably the one place you don't
>want to change too much.

I think that there will be two prevalent parser implementation
strategies: one 8-bit implementation, and one 16-bit
implementation. They should even coexist in the same browser.

The 8-bit implementation would assume that chars <128 agree with
ASCII, and wouldn't care about chars >128 or UTF-8 or ISO-2022 style
multibyte chars -- it would just blindly pass them to the formatter.
Yes: every character _encoding_ would require changes to the lexer
in this case.

The 16-bit version would probably be Unicode/ERCS-based, translating
other encodings to Unicode before parsing.

I think that the 8-bit implementation strategy is cost effective
for a lot of communities. I don't believe we can require everybody
to do the 16-bit implementation.

>>I think going into excurciating amounts of detail is very much
>If I go into much more detail, I'd be supplying you with code...

I wouldn't mind. It's happened before.

>>Where do the fonts come from?
>I should note that I often hear "there's no fonts" as a reason for not
>using Unicode. This is simply not true. It is quite easy (at least on
>X11) to build up a set of fonts by "borrowing" from existing fonts,
>and rearranging the glyphs.

Sure, you can build the fonts. But how do you deploy them? The
NetScape folks just said they'll be relying on the system to
supply fonts. With that constraint, there's no amount of engieering
that NetScape can do to make Unicode work "out of the box" -- there
will have to be some "technical note" or some such about how you
install Unicode fonts on your X server.

This is why I wanted to focus on a problem statement: the way
I see it, the problem is that there are communities where the
senders system is capable of doing Russian, and the receiver's
system is capable of doing russian, but there's no way to get
russian chars through the pipe. A clear specification of how
charset= interacts with the sgml declaration (and hence numeric
character references) solves that.

You've cite cases where the client support some set of charsets,
and the server support another, disjoint set, and so they cannot
communicate at all. I'd say: if that's the case, what are the
odds that the _author_ and the _reader_ can communicate at all?
Just because my browser supports Unicode on the wire doesn't
mean that I can read Korean.

>>The need I see is for folks to be able to use the encodings that
>>they're used to using. I agree that Unicode will ultimately be
>>cost effective, but I wonder about how to get the ball rolling.
>In the figures I pointed out an extraordinarily easy way of enabling
>browsers to be "multi-charset and encoding" aware. I thought they were
>quite understandable, especially to people with SGML knowledge.

"extrordinarily easy..." compared to what? It's a nice clean model,
granted. But it looks like a pretty painful deployment process to me.