Re: Charset labelling (Was: Comments on: "Character Set" Considered Harmful)

Gary Adams - Sun Microsystems Labs BOS (Gary.Adams@East.Sun.COM)
Fri, 28 Apr 95 08:28:33 EDT

> From bobj@netscape.com Thu Apr 27 20:47:18 1995
> Date: Thu, 27 Apr 1995 17:42:01 -0700
> X-Sender: bobj@pop.mcom.com
> Mime-Version: 1.0
> To: Gary.Adams@East (Gary Adams - Sun Microsystems Labs BOS), html-wg@oclc.org,
> bobj@netscape.com
> From: bobj@netscape.com (Bob Jung)
> Subject: Charset labelling (Was: Comments on: "Character Set" Considered Harmful)
> X-Lines: 156
>
> At 9:09 AM 4/27/95, Gary.Adams@East.Sun.COM (Gary Adams - Sun Microsystems
> Labs BOS wrote:
> >...
> >The approaches suggested so far cover a wide range of platforms :
> >...
> > 2. Extensible filnames - just tack on an extension to the filename
> > (e.g. foo.ps.ucs.Z or foo_dir.tar.Z.isolat1 )
>
> Windows filesystem has the 8byte filename + 3 byte extension restriction.
> Unfortunately, this "8.3" limitation also applies to CD-ROM filesystems. The
> labeling solution should continue to allow us to use the same files via http://
> and file://. Further file:// should work for CD-ROMS.

Even the 8.3 restriction on the "component in pathname" doesn't need to be a
show stopper, if the complete pathname is considered to be part of the "name".
ucs\foo.ps
Z\dir\isolat1\foo.tar
ja\iso10646\index.html

>
> Even if I don't use Windows as an HTTP server, I do save links using my
> Windows client. When I save as source, Windows truncates the filename
> extension to 3 bytes (e.g., "filename.html" becomes "filename.htm"). In the
> filename extension scenario, I will have lost my charset info.

The same suggested use of directories can convey the same meaning on the client
side. To some extent what happens on the client side will always be platform
dependent. On my client side I am not restricted to the same 8.3 restriction
that you have. Only the least common denominator approach would restrict the whole
web community to an 8.3 restriction in filenames.

>
> Using filename extensions is not a good solution to the labeling problem.
>

For most serious server platforms the "extensible file name" approach is
extremely viable. I believe falling back to an "extensible pathname" approach
for more restircted file systems would be a reasonable compromise.

> > 3. Shadow files - when the filesystem is fixed length names and
> > doesn't support attributes, shove them in a companion file
>
> Disjoint information tends to have many problems because it easily becomes
> TOTALLY disjointed. E.g., I FTP'd some Mac file the other night and ended
> up with 0 length files because my UNIX file servers keeps the data and resource
> forks in separate files. Arghhhh. So as you can tell, I dislike
> "companion files".

The only way "companion files" can ever work is if we throw away all the legacy
tools that operate on a single file and restart the universe with tools that always
look for complete sets of things to operate on at the same time. In the 90's that
clump of things is often called an object. It embodies all the state and behavior
associated with the object. I would not recommend companion files either.

>
> NEW PROPOSAL:
>
> The proposed HTML 3.0 spec includes the <META> head tag which can "embed
> document meta-information not defined by other HTML elements". The spec states:
>
> In addition, HTTP servers can read the contents of the document
> head to generate
> response headers corresponding to any elements defining a value for
> the attribute
> HTTP-EQUIV. This provides document authors with a mechanism (not
> necessarily
> the preferred one) for identifying information that should be
> included in the response
> headers of an HTTP request.
>
> I propose the following usage for charset labeling:
>
> <META HTTP-EQUIV=Content-Type contents="text/html; charset=iso-2022-jp">
>
> In this case, either the server can peek and send the header as specified
> in the <META>
> tag, or the client can peek (if not already stripped out by the server) and
> override the
> header from the server.
>
> We still have the chicken-and-egg problem for canonical Unicode as Larry
> Masinter
> pointed out. A couple possible remedies:
> (1) restricting HTML to UTF8 form of Unicode
> (2) use "filename.html" ("filename.htm" on Windows and CDROMS) for
> "ASCII-tag-character-superset encodings" and use "filename.uhtml"
> ("filename.uht" for Windows and CDROMS).
>
> Comments?
>

I definitely like the approach that keeps the labeling connected with the data.
The question then becomes, should the labeling be the responsibility of the SGML
or the MIME portion of the Web architecture.

There was a propsal earlier for :

mime-version: 1.0
content-type: text/html; charset=iso-2022-jp

<HTML>
...
</HTML>

and now your proposal for :

<HTML>
<HEAD>
<META HTTP-EQUIV=Content-Type contents="text/html; charset=iso-2022-jp">
</HEAD
...
</HTML>

Seems like somewhat similar solutions. In either case the server will open the
file process the header lines and output the appropriate information on the wire.
For file:// based URLs (and any other protocol scheme where the server is not
capable of introducing the new directives over the wire) the same processing would
be handled in the client application.

The subtle difference lies in where the attributes are applied to the data and what the
underlying containment model is. I tend to favor the MIME approach where I might
have a file doc.mime-headers-included or directory.tar.Z or foo.mime with an enclosed
multi-part-mime document (multilingual, multiformat, whatever).

This problem is similar to the information I might put in a file about it's last
modified date vs the timestamp associated with a file in a directory in the file
system. I believe in giving the "container" as much responsibility over properties
of the data within in it as possible. With out getting into the "object" wars,
let's make sure we pick an approach that the will be upward compatible with more
advanced frameworks as well as finding an expedient way to encapsulate the legacy
data that people want to be viewing today.