Re: Comments on: "Character Set" Considered Harmful

Larry Masinter (masinter@parc.xerox.com)
Sat, 22 Apr 95 17:08:12 EDT

> While this helps content providers to get their documents rendered correctly,
> we do not see this as a total solution. We need a way to label within HTML,
> so that documents can be self-labeling and easier for content developers to
> add this info.

The general resistance to labelling character set within the document
itself is that it doesn't work for things that are not consistent with
US-ASCII, e.g., unicode-1-1-ucs2 (or whatever it will be called).

For example, you could have your server use a file naming convention:

file.html.utf8 => file.html in unicode-1-1-utf8
file.html.iso-2022-jp => file.html in iso-2022-up

This is simple enough for content developers to deal with, and doesn't
muck up either the standard or the protocol.

Of course, there are lots of other ways to associate similar
meta-information with documents (.meta files, etc.), but since you're
already using 'file name' to indicate content-type, loading more of
the content-type information into the file name seems reasonable.

Another alternative is to allow content developers to store things in
some kind of canonical form with the MIME header as well as the
content, e.g., look for 'mime-version' in the head of the file you're
going to serve, and then use that header to override whatever the
server defaults might be.

================================================================
mime-version: 1.0
content-type: text/html; charset=iso-2022-jp

================================================================

All of these options satisfy your requirement without trying to put
the labelling inside the HTML document itself, and have the additional
advantage that they work with text/plain as well as text/html.