Re: Comments on: "Character Set" Considered Harmful

Gary Adams - Sun Microsystems Labs BOS (Gary.Adams@East.Sun.COM)
Thu, 27 Apr 95 10:22:26 EDT

There are still a few places in the current architecture of the
web that internationalization issues may be falling through the
cracks. It's great to see competing ideas flushed out in various
implementations and to see the well known issues nailed down in the
specs that are enabling maximum interoperability over time.

I'd like to here from more server implementors about how they
are planning to address some of the platform specific issues with
serving multilingual documents.

> From: bobj@netscape.com (Bob Jung)
> Subject: Re: Comments on: "Character Set" Considered Harmful
>
> >For example, you could have your server use a file naming convention:
> >
> >file.html.utf8 => file.html in unicode-1-1-utf8
> >file.html.iso-2022-jp => file.html in iso-2022-up
> >
> >This is simple enough for content developers to deal with, and doesn't
> >muck up either the standard or the protocol.
> >
> >Of course, there are lots of other ways to associate similar
> >meta-information with documents (.meta files, etc.), but since you're
> >already using 'file name' to indicate content-type, loading more of
> >the content-type information into the file name seems reasonable.
>
> I'm not convinced that this is reasonable. It seems that we don't want to start
> loading up file extensions with a lot more info. Is "charset" an exception?
> What do others think?
>
> (Another minor point, is that Windows filesystems have the 8byte filename
> + 3byte extension limitation. But will anyone use these for servers anyway?)
>
> If we must support canonical Unicode, we could have a special filename
> extension for it and leave the extensions for other HTML files alone:
>
> file.html.ucs Canonical Unicode
> file.html All other character code set encodings
> (ASCII supersets)
>
> >Another alternative is to allow content developers to store things in
> >some kind of canonical form with the MIME header as well as the
> >content, e.g., look for 'mime-version' in the head of the file you're
> >going to serve, and then use that header to override whatever the
> >server defaults might be.
>
> This "canonical form" would just be a string of bytes that happen to
> be equivalent to the ASCII string "mime-version: 1.0"?
>
> >================================================================
> >mime-version: 1.0
> >content-type: text/html; charset=iso-2022-jp
> >
> ><HTML>
> ></html>
> >================================================================
> >
> >All of these options satisfy your requirement without trying to put
> >the labelling inside the HTML document itself, and have the additional
> >advantage that they work with text/plain as well as text/html.
>
> In this scenario the server would peek into the file, read and remove any
> meta-info
> and then send the rest of the file as HTML, right? The big problem is for local
> files (file://). More and more data will be distributed on CD-ROMs and
> you'd like
> the same files to work via http:// or file://.
>
> (Another minor problem is that you will need a authoring tool (editor), so that
> content developers can preprend this ASCII meta-info to your Unicode HTML file.)

The bottom line here has to do with where to record the meta
information associated with the Web document in a platform
independent manner. Assuming that the document author used a tool
which new how to deal with the source document character set; how
does that information get recorded so the Web server delivers it
to the Web client in a way that is rendered the way the document author
intended.

The approaches suggested so far cover a wide range of platforms :

1. Attributed file systems - the operating system supports
arbitrary properties

2. Extensible filnames - just tack on an extension to the filename
(e.g. foo.ps.ucs.Z or foo_dir.tar.Z.isolat1 )

3. Shadow files - when the filesystem is fixed length names and
doesn't support attributes, shove them in a companion file

4. Pre-document headers - imbed the meta information in a document
header if it's a writable resource.

5. Object database - encapsulate all the legacy data and remove
the legacy tools to prevent potential hygiene problems.

It would be nice to have platform specific standards for the various
approaches, even if we can not acheive consensus in a platform
independent way. From a least common denominator approach, the use of
shadow files of pre-headers if the most viable implementation.

Whatever approach is adopted it would require modifications to the
Web server at the point of contact with the local "file system". i.e.
the network servers httpd, gopherd, ftpd and local-web-browser-client
for file:// URLs.

> From: Albert-Lunde@nwu.edu (Albert Lunde)
> Subject: Re: Comments on: "Character Set" Considered Harmful
..
>
> Off-hand, this looks like a can-of-worms because of the mix
> of HTML,HTTP,MIME and SGML issues lurking around the edges.

This is the most salient summarization of the problem I've heard in a
long time. I think the heart of the problem and the eventual solution
will lie in the boundary condtions betweeen MIME and SGML in the
current architecture of the Web.

If we were just looking at the HTML/HTTP issues, I think many would
claim that the problem is already solved at the protocol on the wire
and the ability to select particular language mappings within the
documents. The complete solution needs to cover other document
types (text, postscript, etc.) other delivery vehicles (ftp, gopher,
cdrom) and other interactions (text entry, file upload, locale
specific renderings).

>
> If we can't "solve" this promptly I'd be tempted to say for 2.x (small x)
> that representing meta-information for documents on disk
> is an implementation issue, and look for nicer fixes
> later.

The last significant deployment of internationalization technology
on the Web was probably the L10N Mosaic browser. It was an excellent
proof of concept that demonstrated what could be done within the HTTP 1.0
and HTML 2.0 framework. Part of the refinement process in this arena
will require some form of competing implementation if we expect to see
any progress towards universal standards.

In the HTML 3.0 timeframe it would be good to see some standardization
on the part of the commercial browser developers in the handling of
character sets and in the rendering of World wide web documents.

$.02