Re: Comments on: "Character Set" Considered Harmful

Bob Jung (bobj@netscape.com)
Wed, 26 Apr 95 21:14:51 EDT

At 2:07 PM 4/22/95, Larry Masinter wrote:
>> While this helps content providers to get their documents rendered correctly,
>> we do not see this as a total solution. We need a way to label within HTML,
>> so that documents can be self-labeling and easier for content developers to
>> add this info.
>
>The general resistance to labelling character set within the document
>itself is that it doesn't work for things that are not consistent with
>US-ASCII, e.g., unicode-1-1-ucs2 (or whatever it will be called).

Yes, most popularly used character code set encodings are ASCII supersets.
Are there any examples of ones that are not?

Since Unicode is not really being used today in HTML, couldn't we stipulate that
Unicode HTML use UTF8 encoding? If so, then this would not be an issue.

One drawback to UTF8 in that the size of Asian characters size expand. But
mono-lingual documents could still use the national encodings (JIS, GB, KSC,
Big5, etc.) and this would also avoid the Unicode unified CJK rendering
problem. Like it or not, these encodings will have to be supported because of
the amount of existing data.

>For example, you could have your server use a file naming convention:
>
>file.html.utf8 => file.html in unicode-1-1-utf8
>file.html.iso-2022-jp => file.html in iso-2022-up
>
>This is simple enough for content developers to deal with, and doesn't
>muck up either the standard or the protocol.
>
>Of course, there are lots of other ways to associate similar
>meta-information with documents (.meta files, etc.), but since you're
>already using 'file name' to indicate content-type, loading more of
>the content-type information into the file name seems reasonable.

I'm not convinced that this is reasonable. It seems that we don't want to start
loading up file extensions with a lot more info. Is "charset" an exception?
What do others think?

(Another minor point, is that Windows filesystems have the 8byte filename
+ 3byte extension limitation. But will anyone use these for servers anyway?)

If we must support canonical Unicode, we could have a special filename
extension for it and leave the extensions for other HTML files alone:

file.html.ucs Canonical Unicode
file.html All other character code set encodings
(ASCII supersets)

>Another alternative is to allow content developers to store things in
>some kind of canonical form with the MIME header as well as the
>content, e.g., look for 'mime-version' in the head of the file you're
>going to serve, and then use that header to override whatever the
>server defaults might be.

This "canonical form" would just be a string of bytes that happen to
be equivalent to the ASCII string "mime-version: 1.0"?

>================================================================
>mime-version: 1.0
>content-type: text/html; charset=iso-2022-jp
>
><HTML>
></html>
>================================================================
>
>All of these options satisfy your requirement without trying to put
>the labelling inside the HTML document itself, and have the additional
>advantage that they work with text/plain as well as text/html.

In this scenario the server would peek into the file, read and remove any
meta-info
and then send the rest of the file as HTML, right? The big problem is for local
files (file://). More and more data will be distributed on CD-ROMs and
you'd like
the same files to work via http:// or file://.

(Another minor problem is that you will need a authoring tool (editor), so that
content developers can preprend this ASCII meta-info to your Unicode HTML file.)

Do we care about docs of mixed encodings? With many Mac word processors, I
can create documents in mixed-encodings. The Mosaic-L10N folks have been doing
a lot of work with ISO-2022-xx encodings. X-windows has compound-text which
is similar to 2022. How do I put these types of data on the Web?

One answer is that these docs must be converted to some form of Unicode
(ucs, utf8).

Another answer is to support have encoding tags. I don't really feel
strongly either
way, but we should consciously make this decision.

If we do convert mixed-encoding text to Unicode, then we will need to use the
LANG tag to diambiguate unified CJK characters for rendering in the
"proper" fonts.

Regards,
Bob

--
Bob Jung        bobj@netscape.com       +1 415 528-2688, fax +1 415 528-4122
Netscape Communications Corp.   501 E. Middlefield      Mtn View, CA   94041