Re: A truly multilingual WWW

David Goldsmith (david_goldsmith@taligent.com)
Mon, 16 Jan 95 19:46:06 EST

Sorry to take so long to reply to Gavin Nicol's proposal; first there were
the holidays, then things got busy at work.

In general, I think the document does a good job of discussing the issues,
and I agree (not surprisingly) with the idea of using Unicode for this
purpose. I do have some specific comments on aspects of the document.

Section 3.2.1

There is no character set "UNICODE-1-1-UCS-2"; the UCS-2 form is identified
by "UNICODE-1-1" (cf. RFC 1641). There is also no "UNICODE-1-1-UTF-8", but
I am in the process of getting it registered right now. I don't
particularly consider UTF-7 a good candidate for WWW work, since HTTP is an
8-bit-clean protocol, but if someone had another context in mind where a
7-bit safe form was needed, I'd like to know about it.

Similarly, later in the section it says that "[UTF-7] is perfect for WWW
and MIME uses". The same comment applies.

It says at the end of 3.2.1 that the costs of translating documents to
Unicode might appear prohibitive, but that caches would solve the problem.
I guess I don't understand why transcoding a typical HTML document using a
table would be expensive. I would expect it to be much cheaper than the
time needed to push the document out over the wire. Why would this be
prohibitively expensive?

Section 3.2.2

I'm of two minds about supporting other non-ASCII character sets. On the
one hand, lots of people are doing this already, and you can't very well
expect them to stop. On the other hand, formalizing this will impede the
goal of getting to internationalization of the WWW, where any client can
talk to any server. You would get into the situation where clients and
servers spoke different character sets, just like now (although you'd know
it up front, rather than finding out when garbage showed up on your
screen).

I suppose that if the number of supported character sets was limited
(unlikely), clients could support them all, translating through Unicode.
This doesn't seem like it's a probably outcome.

This feature is probably necessary to support existing practice, but I
worry it will lead to "Balkanization" of the WWW, namely clients and
servers that don't speak to each other. Of course, even with Unicode, if
your server is serving up Thai, and I don't have any Thai fonts, I'm out of
luck. It would be nice, though, if that didn't happen just because you and
I use different character sets.

Section 3.2.4

This is probably the section I have the most problem with. Unicode
specifically was designed with the idea that attributes such as language,
fonts, etc. would be encoded out-of-band, via high level tags or even out
of the character stream entirely (see section 2 of the Unicode Standard,
volume 1). In particular, the private use area is specifically reserved for
use by corporations and end users, by private agreement, and trying to
assign a code in the private use area for general public use is contrary to
the spirit and probably the letter of the standard. I know that there has
been considerable resistance to other attempts to include formatting codes
in the standard, so I don't think this is the right approach.

On the other hand, using embedded escape sequences (like HTML) is perfectly
all right, and in fact on page 15 of volume 1 there is an example of this
kind. The only difference in this example is that a character code is not
assigned as an escape sequence.

Although the proposal could be amended to be compatible by using existing
character codes, I think this would duplicate work being done for HTML, so
I think this problem should be dealt with at a higher level. Furthermore, I
don't think it needs to be dealt with before Unicode can start being used
for WWW purposes. The only time such a tag will be used is (for example) if
a client could display both Japanese and Chinese; the tag would then help
specify which one to choose. For the vast majority of monolingual clients,
you'll use whatever font is available, so the tag would be ignored. Plain
Unicode is more than capable of basic legibility when displaying text,
without additional language or font tags (again, see the design principles
in section 2 of the standard). Such display may not be typographically
ideal, but then HTML is nowhere near rich enough for advanced typography
anyway (right now, anyway).

For now, then I propose that this issue be deferred. High-level tags are
the right way to deal with it, because like other font, size, style, etc.
information, it's not critical to the basic legibility of the information.
If HTML 3.0 has a language tag, all the more reason not to invent another
mechanism. It certainly doesn't belong at the level of character codes.

Section 5.1

I'm pleased to hear that the issue on not requiring stict MIME line break
rules has been dealt with. It would be nice if there were a reference of
some sort here so that people could see the decision in writing (assuming
it's written down somewhere).

Section 6

It's worth adding the URL for the Unicode WWW page, http://unicode.org/, in
addition to the ISBN numbers for the two books.

Again, thanks to Gavin for putting the work into writing up this proposal.
Does anyone else have comments?

----------------------------
David Goldsmith
david_goldsmith@taligent.com
Senior Scientist
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA 95014-2233