Re: SGML/MIME charsets, ERCS/Unicode [was: New DTD (final version?) ]

Gavin Nicol (gtn@ebt.com)
Thu, 9 Feb 95 09:24:19 EST

I find it quite incredible that some people obviously have time to dig
into obscure places within the WWW, but do not read the mail coming
through this list... and then try to offer comment on issues that have
been discussed before.

>I didn't arbitrarily say 13,10. It comes from the MIME spec.
..
>Once you use the MIME text/* rules to determine the lines, you then

It was long ago decided in the HTTP working group that HTTP does not
require strict conformance to MIME in this area. It was also noted
that this is not in the MIME *standard*, but rather in a draft being
circulated. I believe Larry corrected me on this issue, and I have a
note to that effect in my paper (posted to this group but obviously
not read).

>Ok.. I'm starting to understand what ERCS is. And for general
>MIME/SGML stuff, it looks like a good idea.

I'm ever so glad you think so.

>But I don't think it's necessary for HTML: nobody is allowed to
>change the syntax character set in an HTML document (or LCNMCHAR,
..

Yes.

>As to character classification: all the parser needs to know is: could
>this character possibly be markup? For any character outside the
>ISO-646 repertiore, the answer is no (for HTML, anyway).

Correct, and in Goldfarb, page 200, 7.3.2 he states that anything not
classified otherwise is automatically a dedicated data character.

>Perhaps ERCS is the most cost effective solution, but just for my own
>edification: is my conjecture that it is not necessary for HTML
>sound?

To a certain degree, yes. If one only changed the BASESET, you'd
effectively be limiting yourself to character sets in which ASCII is a
subset (admittedly a large group). However, we also open ourselves to
problems: for example, wide spaces in SEPCHAR areas, and SHUNCHARS.

One way or another, are going to be relying on the browser writers
to "get it right", and they probably won't. Even if they do, it'll
mean that (potentially) every character set supported will require
changes to the parser, and that is probably the one place you don't
want to change too much.

>>I won't go into excruciating amounts of detail, but rather, let the
>>figures speak for themselves, and let inventive minds think of the
>>optimisations one could perform on the above.
>
>I think going into excurciating amounts of detail is very much
>necessary.

If I go into much more detail, I'd be supplying you with code...

>I think there is some implicit data in the above figures. I don't see
>how the MIME charset parameter fits in, for example. And it's not
>precisely clear what the "bandwidth" of all the arrows is.

The mime charset decides which decoder, and which normaliser (mapping
table) are used. Everything up to the normaliser would be of whatever
"width" the encoding required, and everything after that, would
(ideally) be in 16 bits (ie. UCS-2).

>For example: does every browser have to support the whole unicode
>character set?

Not necessarily, though that would be ideal. You can state preferences
via the Accept-Language, and Accept-Charset: parameters, and limit the
output side to whatever you want (limiting the parser makes no sense
since supporting the whole thing is trivial).

>Where do the fonts come from?

Have a look at the SAM ftp site <URL:ftp://ftp.cc.monash.edu/matty/>,
and on <URL:http://www.ntt.jp/> for some fonts. The former also has
9term (a Unicode terminal program for X) in another directory (I
forget which). These will give you a start toward building a full set
of Unicode fonts.

I should note that I often hear "there's no fonts" as a reason for not
using Unicode. This is simply not true. It is quite easy (at least on
X11) to build up a set of fonts by "borrowing" from existing fonts,
and rearranging the glyphs.

>The need I see is for folks to be able to use the encodings that
>they're used to using. I agree that Unicode will ultimately be
>cost effective, but I wonder about how to get the ball rolling.

In the figures I pointed out an extraordinarily easy way of enabling
browsers to be "multi-charset and encoding" aware. I thought they were
quite understandable, especially to people with SGML knowledge.

>Hmmm... I have a press release about Spyglass and NEC doing a

Spyglass, maybe. NEC. I used to work there...

>UTF-8 I'll buy. UCS-2 has some serious compatibility issues. You can't
>do UCS-2 and call it text/html, for example, because of the MIME CRLF
>rules. You'd have to call it application/html.

This myth has already been dispelled (at least as far as HTTP is
concerned).

Please read my paper. If you like, I'll send you a copy nicely marked
up in (unvalidated) HTML. I'll also send you ERCS if you want it.

Hey. It passed the Mosaic+Netscape test... :-)

---
Gavin "I must be a gadfly by now, surely." Nicol
NOT speaking for EBT!