Re: Comments on: "Character Set" Considered Harmful

Gavin Nicol (gtn@ebt.com)
Wed, 26 Apr 95 11:09:08 EDT

>I second Amanda's appeal to address the pressing pragmatic issues at
>hand, especially the labelling issue.

Adopting ISO 10646 as the document character set solves one set of
problems, but labelling is another kettle of fish altogether...

>While this helps content providers to get their documents rendered correctly,
>we do not see this as a total solution. We need a way to label within HTML,
>so that documents can be self-labeling and easier for content developers to
>add this info.

We cannot do this and remain SGML-conformant. In addition, we must be
able to decide the coded character set and encoding of the document
*before* we begin parsing.

>Almost all HTML I've seen has been in a single encoding.

I would hope so!

>And none of these clever techniques is 100% deterministic...
>And unfortunately, more and more Japanese Web data is in SJIS...

Yes, this is a problem. SJIS is moving into the Unix world now too...

>From Dan and Gary's messages it would appear that the charset
parameter provides sufficient information to determine the maping from
a sequence of bits to code points in a coded character set that can
also be determined from the charset parameter. As such, this seems the
best place for labelling to occur, though this will require a single
coded character set and encoding be used for the entire document. I
think that is a reasonable tradeoff.

Can anyone think of cases where the charset parameter will *not*
suffice? I have a nagging feeling, but nothing firm in my mind...