Re: Comments on: "Character Set" Considered Harmful

Gavin Nicol (gtn@ebt.com)
Wed, 26 Apr 95 22:26:18 EDT

>>The general resistance to labelling character set within the document
>>itself is that it doesn't work for things that are not consistent with
>>US-ASCII, e.g., unicode-1-1-ucs2 (or whatever it will be called).
>
>Yes, most popularly used character code set encodings are ASCII supersets.
>Are there any examples of ones that are not?

A greater one is that it complicates the processing model immensely,
and in fact, it violates the separation of the entity manager and
parser found in SGML (ie. the parser needs to be able to inform the
entity manager to alter it's handling of the octet stream).

>Since Unicode is not really being used today in HTML, couldn't we
>stipulate that Unicode HTML use UTF8 encoding?

No. I for one, would like to use UCS-2. Anyway, it would still violate
the processing model for SGML.

>Do we care about docs of mixed encodings? With many Mac word
>processors, I can create documents in mixed-encodings. The
>Mosaic-L10N folks have been doing a lot of work with ISO-2022-xx
>encodings. X-windows has compound-text which is similar to 2022.
>How do I put these types of data on the Web?
>
>One answer is that these docs must be converted to some form of Unicode
>(ucs, utf8).

This is the simplest answer I can think of, hence my earlier
proposal. Having the browser support all these encodings is almost
impossible.

>Another answer is to support have encoding tags.

We cannot do this within an HTML document without complicating things
immensely. The document should be converted to a single coded
characters set before the parser proper ever even sees it.

>If we do convert mixed-encoding text to Unicode, then we will need to
>use the LANG tag to diambiguate unified CJK characters for rendering
>in the "proper" fonts.

Or use an encoding containing "hints", which could possibly include
gkyoh image specification hints as well...