Re: format nego in HTML/10646?

Martin J Duerst (mduerst@ifi.unizh.ch)
Mon, 8 May 95 09:32:06 EDT

I use this mail to say that I agree with the wording of Dan
in his recent mails regarding ISO 10646.

Now for the question of "how do we know if we can display
all the characters in a document?".
I think this is a difficult question, but I would like first to
state two very simple, but important things:

a) Specifying ISO 10646 as a document character set
(now or in the (near!) future) gives us a very straightforward
way to know that we can render all characters: implement
full ISO 10646! Now while this is not necessarily an easy
task, depending on what I can build on, it is at least clearly
defined and finite. Also, to provide all the characters in
a single font is within the range of what can be distributed
with an application (I guess it is between 1M and 2M, depending
on font size).
In any other alternative I would know, making HTML truely
multilingual would mean that implementation is without
any clear end.
b) Not knowing whether I will be able to fully render the document
is not worse than what we have at present, e.g. with L10N.

This may be somewhat beyond the current discussion points, but
I would for the moment propose that there is no need to think
about a system that, in a more or less complex way, identifies
code ranges, and so on.
My main argument is that there is little difference whether I
get a document in e.g. Bengali, and can't display it, or whether
I can display that document, but can't read it. And I can assume
that a user that wants to work with Bengali on his/her computer
will install the necessary fonts and whatever else is necessary,
at least after (s)he gets a message that some character couldn't
be displayed properly. And I guess we will get language
information anyway in some internationalized HTML. If a
user then says that (s)he can read documents in Bengali,
we can as well presume that these documents can be
displayed.
The only case where this argument falls short is the case of
information about the Bengali script, such as a page entitled
"The beauty of the Bengali script", but then in such cases,
the use of inline images might be a better idea.

As error behaviour of future browsers, I could immagine:
a) "Not possible to display some characters. Replaced by '?'."
b) "Not possible to display some characters
in range xxxx-xxxx (Bengali). Replaced by '?'."
c) (e.g. on a Mac) "Not possible to display some characters
in range xxxx-xxxx (Bengali). Replaced by '?'. Installing
appropriate Worldscript extension might help (see...)."

To put the question in another way: What would we do if we
detect (e.g. at the server) that the browser will not be able
to display all characters?
- Send nothing at all
- Send the stuff anyway
- Try to send another file with the same content (in this area,
we already have "language" and "encoding" to care with,
and adding another multiplicative factor is not really a
good solution.
- Convert unknown characters to inline images (very nice, apart
from the boxes between characters and the high data volume
generated).

>>In other words, how is any information communicated that allows format
>>negotiation over the question "does the client have fonts to render this
>>document"? or if not "fonts", then "reasonable means"?
>
>If a client requests iso-2022, then the data provider should take that
>as a strong indication that the system can do Japanese, but little
>else.
What if a client requests several encodings? On a per-document
basis, should they be considered exclusive or cumulative?
(in the case of L10N, this is exclusive (well, you can switch fonts
to view one and the other part of the document successively),
while on other systems, it may be cumulative.

>>or do we have to accept that the world has 65,500 characters that we
>>may be called upon to render at a moment's notice?

At present, it is somewhat below 40'000. And just having one
font in one size is not that memory-demanding.

>I can understand the desire for this, but as I noted, even today,
>there are failure cases with just US-ASCII and ISO-8859-1.
>
>It might be very desireable to include ranges in the charset=xxxx
>parameter, and in Accept-Charset field. One advantage of using ISO
>10646 is that at least we have one single character set that we can
>used ranges from as indicators.

Ranges could be handy in some cases, e.g. to exclude Right-to-
left scripts if the system can't handle them. But they are not
enough in the case of CJK. A Japanese system may have only
the JIS 0208 kanji available, which is a subset of about 6000
out of 21000 just sprinkled over the whole range.
[Glenn: can you give some key words regarding "limited subsets"
and "selected subsets" in ISO 10646, just to give us an idea
whether they might be useful?]

>In fact, I think that if (as appears very likely) we *do* move to ISO
>10646 at some point, Terrys' idea of specifying ranges should be given
>some consideration. However, I do not think that *not* having it
>causes enough problems to halt the adoption of ISO 10646.

I agree.

Regards, Martin.

----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
----