(Fwd) Re: ISO/IEC 10646 as Document Character Set

Anoosh Hosseini (anoosh@gorgan.mti.sgi.com)
Thu, 4 May 95 15:50:51 EDT

--- Forwarded mail from gtn@ebt.com

Date: Thu, 4 May 95 04:21:42 EDT
Reply-To: gtn@ebt.com
From: Gavin Nicol <gtn@ebt.com>
To: Multiple recipients of list <html-wg@oclc.org>
Subject: Re: ISO/IEC 10646 as Document Character Set

>> This is a most inventive answer to the problem, but Roy has stated,
>> quite strongly, that only ISO-8859-1 will be standard.
>
>Well perhaps my command of english fails, but I don't believe rough
>concensus means everyone agrees. I don't have any direct opinion, but
>my indirect opinion is that the many folks who seem to understand
>the issues still don't agree... that wouldn't meet my test of
>rough concensus.

1) The default will be ISO 8859-1
- This is a firm decision.
2) Clients should support the charset parameter
- This is a firm decision. Timing is the only issue.
3) The document character set of HTML should be ISO 10646
- Rough concensus on this.

--------------------
Agree with Point 1.

Point 2: Yes clients need to start saying who they are, and what
they support, then servers can detect them and send proper
document/encoding. I have used the "accept=charset" stuff in
my work and have my server detect that. However server code needs
to be improved to better handle "accept charset" and "accept
language" parameters and deal with them appropriately. I have for
3 months been sending charset=X-ISIRI4432 from a server, however
many clients barf at this and since they cannot resolve it
via .mailcap mime.types, just save it to a file. Not very nice from
a users point of view. The most recent Netscape just displays it as
ISO-8859-1 jiberish which is fine.

As for support, the clients should first start parsing the charset parameter
rather than treating it as part of one string: "text/html; charset=xxxxx"
which is what NCSA currently does. The next problem is getting that
information from the MIME header parsing routines to the HTML document
instances. (internal data structures). In this regards I believe the
L10N Mosaic has all the hooks there (they keep character set/font info
per Document) fixing their code should be minor work.

As I stressed before it is the clean negotiation between Client and
server which is important. Since we have all kinds of browsers on
all types of machines which can have various level of non latin-1
support. (1 locale, mutli-locale, multi-lingual etc..)

Here is an example, we have support for a ISO-8859-X language
using a localized browser, but this browser only runs on platforms
A and B. We have external viewers for the language that can be invoked by
correct setup in .mailcap .mime.types files on platforms C and D. Now how is a
server to distinguish between these cases? The localized browser sends a
unique browser ID plus sends accept-char info + accept lang info. Current
servers do not handle this in an elegant way so one has to deal with it via
CGI
scripts and reading environment variables. Now to the external viewer.
The viewer can be used with any number of browsers available on a platform and
so on the server side, we have no way of detecting their presence other than
manual user notification ( pressing a button). Browsers today to not parse the
.mailcap files and send them to the server ( as far as I know).

Finally on Point 3. I think 10646 is conceptually nice, however given
the fact that we accept different "representations" such as 8859-X
being actually sent to the browser via HTTP, then this becomes the "HTML
document" that the client sees. Servers may pre-map the 10646 to 8859-X
if only one none-Latin1 language is used and so basically the
10646 is really not "visible" to the outside world. Thus we are back to
HTML markup in US-ASCII and everthing else as data.

-anoosh

(speaking for myself)