Charset labelling (Was: Comments on: "Character Set" Considered Harmful)

Bob Jung (bobj@netscape.com)
Thu, 27 Apr 95 20:50:09 EDT

At 9:09 AM 4/27/95, Gary.Adams@East.Sun.COM (Gary Adams - Sun Microsystems
Labs BOS wrote:
>...
>The approaches suggested so far cover a wide range of platforms :
>...
> 2. Extensible filnames - just tack on an extension to the filename
> (e.g. foo.ps.ucs.Z or foo_dir.tar.Z.isolat1 )

Windows filesystem has the 8byte filename + 3 byte extension restriction.
Unfortunately, this "8.3" limitation also applies to CD-ROM filesystems. The
labeling solution should continue to allow us to use the same files via http://
and file://. Further file:// should work for CD-ROMS.

Even if I don't use Windows as an HTTP server, I do save links using my
Windows client. When I save as source, Windows truncates the filename
extension to 3 bytes (e.g., "filename.html" becomes "filename.htm"). In the
filename extension scenario, I will have lost my charset info.

Using filename extensions is not a good solution to the labeling problem.

> 3. Shadow files - when the filesystem is fixed length names and
> doesn't support attributes, shove them in a companion file

Disjoint information tends to have many problems because it easily becomes
TOTALLY disjointed. E.g., I FTP'd some Mac file the other night and ended
up with 0 length files because my UNIX file servers keeps the data and resource
forks in separate files. Arghhhh. So as you can tell, I dislike
"companion files".

NEW PROPOSAL:

The proposed HTML 3.0 spec includes the <META> head tag which can "embed
document meta-information not defined by other HTML elements". The spec states:

In addition, HTTP servers can read the contents of the document
head to generate
response headers corresponding to any elements defining a value for
the attribute
HTTP-EQUIV. This provides document authors with a mechanism (not
necessarily
the preferred one) for identifying information that should be
included in the response
headers of an HTTP request.

I propose the following usage for charset labeling:

<META HTTP-EQUIV=Content-Type contents="text/html; charset=iso-2022-jp">

In this case, either the server can peek and send the header as specified
in the <META>
tag, or the client can peek (if not already stripped out by the server) and
override the
header from the server.

We still have the chicken-and-egg problem for canonical Unicode as Larry
Masinter
pointed out. A couple possible remedies:
(1) restricting HTML to UTF8 form of Unicode
(2) use "filename.html" ("filename.htm" on Windows and CDROMS) for
"ASCII-tag-character-superset encodings" and use "filename.uhtml"
("filename.uht" for Windows and CDROMS).

Comments?

For your convenience, I've attached the section on <META> from the 3.0 proposed
spec. See http://www.hpl.hp.co.uk/people/dsr/html/dochead.html.

-Bob

==================== HTML 3.0 spec <META> excerpt =================

META

The META element is used within the HEAD element to embed document
meta-information
not defined by other HTML elements. Such information can be extracted by
servers/clients
for use in identifying, indexing and cataloging specialized document
meta-information.

Although it is generally preferable to used named elements that have well
defined semantics
for each type of meta-information, such as title, this element is provided
for situations where
strict SGML parsing is necessary and the local DTD is not extensible.

In addition, HTTP servers can read the contents of the document head to
generate response
headers corresponding to any elements defining a value for the attribute
HTTP-EQUIV.
This provides document authors with a mechanism (not necessarily the
preferred one) for
identifying information that should be included in the response headers of
an HTTP request.

The META element has three attributes:

NAME
Used to name a property such as author, publication date etc. If
absent, the name
can be assumed to be the same as the value of HTTP-EQUIV.
CONTENT
Used to supply a value for a named property.
HTTP-EQUIV
This attribute binds the element to an HTTP response header. If the
semantics of the
HTTP response header named by this attribute is known, then the
contents can be
processed based on a well defined syntactic mapping, whether or not
the DTD includes
anything about it. HTTP header names are not case sensitive. If
absent, the NAME
attribute should be used to identify this meta-information and it
should not be used
within an HTPP response header.

Examples:

If the document contains:

<META HTTP-EQUIV=Expires CONTENT="Tue, 04 Dec 1993 21:29:02 GMT">
<META HTTP-EQUIV="Keywords" CONTENT="Nanotechnology, Biochemistry">
<META HTTP-EQUIV="Reply-to" CONTENT="dsr@w3.org (Dave Raggett)">

The server will include the following response headers:

Expires: Tue, 04 Dec 1993 21:29:02 GMT
Keywords: Nanotechnology, Biochemistry
Reply-to: dsr@w3.org (Dave Raggett)

When the HTTP-EQUIV attribute is absent, the server should not generate an HTTP
response header for this meta-information, e.g.

<META NAME="IndexType" CONTENT="Service">

Do not use the META element to define information that should be associated
with an existing HTML element.

Example of an inappropriate use of the META element:

<META NAME="Title" CONTENT="The Etymology of Dunsel">

Do not name an HTTP-EQUIV attribute the same as a response header that
should typically
only be generated by the HTTP server. Some inappropriate names are
"Server", "Date",
and "Last-Modified". Whether a name is inappropriate depends on the
particular server
implementation. It is recommended that servers ignore any META elements that
specify HTTP equivalents (case insensitively) to their own reserved
response headers.

--
Bob Jung        bobj@netscape.com       +1 415 528-2688, fax +1 415 528-4122
Netscape Communications Corp.   501 E. Middlefield      Mtn View, CA   94041