Re: SGML/MIME charsets, ERCS/Unicode [was: New DTD (final version?) ]

Gavin Nicol (gtn@ebt.com)
Fri, 10 Feb 95 10:21:36 EST

>I didn't see a clear problem statement, and the paper had a "solve
>all the world's problems in one fell swoop" slant.

Guilty :-{ I will rectify that. Perhaps I should be in marketing...
(then you wouldn't have to shoot me as an engineer or a technical
writer. We'd have managers and marketing people left... sounds like a
very large software company I know of... :-))

>So we can agree that we haven't seen the last word on this. I think
>that it's silly for HTTP toefail to interoperate cleanly with MIME.
>I think there will be some give and take on both sides... we'll see.

If the MIME committe specifies this, the people who pass it should be
sentenced to life on Stewart Island, to get a real feel for
isolationism.

>I'd say if you want to use UCS-2, call it application/html.
>It's a low cost option that keeps everybody happy, no?

Why the special case based on character set and encoding only?

>I'm willing to stipulate to all of the above for HTML (but to be
>clear: we're talking about character sets whose repertiore contains
>the repertiore of the ASCII character set: all the characters have to
>be there, but they don't have to have the same numbers.)

How about character sets which might have multiple representations of
ASCII?

Also, this (potentially) might involve a mapping table. Why not map to
Unicode and be done with it?

>I maintain that if we specify "it" sufficiently clearly, they will.
>Only time will tell.

Considering the importance you place of formalism, I'm surprised at
your optimism. The Mosaic Communications people have already displayed
a remarkable ignorance of SGML... (and I apologise to the individuals,
but I feel this statement to be broadly true). I also doubt that you
can state "it" sufficiently clearly in the face of all possible
encodings and character sets...

>I think that there will be two prevalent parser implementation
>strategies: one 8-bit implementation, and one 16-bit
>implementation. They should even coexist in the same browser.

Isn't one the superset of the other...

>The 8-bit implementation would assume that chars <128 agree with
>ASCII, and wouldn't care about chars >128 or UTF-8 or ISO-2022 style
>multibyte chars -- it would just blindly pass them to the formatter.
>Yes: every character _encoding_ would require changes to the lexer
>in this case.

The lexer also has to deal with things like wide spaces and
whatnot. Entity references provide another source of
difficulty... though they are largely an unexplored part of the
inherent HTML model...

>I think that the 8-bit implementation strategy is cost effective
>for a lot of communities. I don't believe we can require everybody
>to do the 16-bit implementation.

If you can do an 8 bit parser, 16 bits is a trivial expansion. The
only real problem is the display subsystem... and in most cases, that
is not overly difficult either (the only diffeenc ein 16 bit and 8 bit
is the size fo the characetr classification tables). One could limit
the input to, and the output from, a 16 bit parser to what can be
handled by the 8 bit display subsystem...

>>If I go into much more detail, I'd be supplying you with code...
>
>I wouldn't mind. It's happened before.

What is the benefit to me? To EBT?

That said, I *may* supply you with the detail you need, though as a
software engineer, I find it rather surprising to need to do so...

>Sure, you can build the fonts. But how do you deploy them? The

C'mon. Many, many applications install their own fonts at install
time. This is pure smoke. No mirrors.

>This is why I wanted to focus on a problem statement: the way
>I see it, the problem is chat ther ars communities where the
>senders system is capable of doing Russian, and the receiver's
>system is capable of doing russian, but there's no way to get
>senders system is capable of doing Russian, and the receiver's
>system is capable of doing russian, but there's no way to get
>russian chars through the pipe. A clear specification of how
>charset= interacts with the sgml declaration (and hence numeric
>character references) solves that.

Excuse me for not understanding the above...

If you have ERCS as a base, the only problem you face (once you've
done a very little homework) is fonts. One thing I have recently
proposed is an SGML compatible encoding of Unicode (that can be
encoded by UTF-8), that will allow you to specify retrieval of fonts
(if so desired), and language tags. Sadly, this is a real religious
war zone...

>You've cite cases where the client support some set of charsets,
>and the server support another, disjoint set, and so they cannot
>communicate at all. I'd say: if that's the case, what are the
>odds that the _author_ and the _reader_ can communicate at all?
>Just because my browser supports Unicode on the wire doesn't
>mean that I can read Korean.

Perhaps so, but there might be interesting links with English titles
that you might like to follow. Also, some people have a remarkable
talent for "intuition".

I would rather see a page of Korean than a Latin 1 message saying
"Sorry, don't even bother trying to read this.", which could very well
become the default case in a system without Unicode as a lingua
franca.

>>In the figures I pointed out an extraordinarily easy way of enabling
>>browsers to be "mul-charsetldnd encoding" cware. I2thought withey were
>>quite understandable, especially to people with SGML knowledge.
>
>"extrordinarily easy..." compared to what? It's a nice clean model,
>granted. But it looks like a pretty painful deployment process to me.

Can you fall off a log ;-)

Actually, deployment would be so easy that it might very well weaken
the argument for the server having to supply Unicode! This is one
extra reason why I must carefully evaluate the potentials before I "go
into excruciating detail"....