Adopting ISO 10646 as the document character set solves one set of
problems, but labelling is another kettle of fish altogether...
>While this helps content providers to get their documents rendered correctly,
>we do not see this as a total solution. We need a way to label within HTML,
>so that documents can be self-labeling and easier for content developers to
>add this info.
We cannot do this and remain SGML-conformant. In addition, we must be
able to decide the coded character set and encoding of the document
*before* we begin parsing.
>Almost all HTML I've seen has been in a single encoding.
I would hope so!
>And none of these clever techniques is 100% deterministic...
>And unfortunately, more and more Japanese Web data is in SJIS...
Yes, this is a problem. SJIS is moving into the Unix world now too...
>From Dan and Gary's messages it would appear that the charset
parameter provides sufficient information to determine the maping from
a sequence of bits to code points in a coded character set that can
also be determined from the charset parameter. As such, this seems the
best place for labelling to occur, though this will require a single
coded character set and encoding be used for the entire document. I
think that is a reasonable tradeoff.
Can anyone think of cases where the charset parameter will *not*
suffice? I have a nagging feeling, but nothing firm in my mind...