Re: Charsets: Problem statement/requirements?

Gavin Nicol (gtn@ebt.com)
Thu, 9 Feb 95 09:32:43 EST

[Sorry, my mailer cut of half the last message. This is a repeat]
Bob Jung writes:

>I believe these should be treated as a byte value in the
>"charset=something-other-than-latin1" encoding. If the content
>developer wants to specifiy a multibyte character, use something like:
>
> &#nnn&#nnn

This is quite incorrect. SGML knows nothing at all about bytes.

Joe English asks:

>How should numeric character references (&#nnn;) be interpreted in
>text/html; charset=something-other-than-latin1 ?

Well, if you have a copy of Goldfrab handy, have a look on page
161, section 4.5.2 where it says:

<quote>
If the function is wanted, a "named character reference"
incorporating the function name is used; otherwise a numeric character
reference is used, and the character is treated as data.
</quote>

and at page 357, section 9.5 note #2 which says:

<quote>
When a document is translated to a different document character set,
the character number of each numeric character reference must be
changed to the corresponding character number of the new set.
</quote>

Bottom line: the numeric character references must be mapped onto the
corresponding character in the new document character set <emph>if it
is being translated from one character set to another.</emph>.

>What if the MIME charset= parameter specifies a multibyte encoding?

The <emph>encoding</emph> has nothing to do with it because the parser
is not concerned at all with that: it only knows about characters, and
uses the specified character set to map codes to them.

>Will this break the "Added Latin 1 for HTML" entity set, which uses
>numeric character references to define all the entities?

Probably, because the MIME charset=xxxx is specifying the document
character set, and the numeric character entities will be resolved
using it (because we are not performing a translation). However, I
would assume that the HTML parser would be smart enough to map them
onto something reasonable (that's the benefit of using named character
references: changes are isolated to one spot, and in something like
HTML, where the parsers are hard coded, one could also hard code the
mappings for all supported character sets).

Mention SDATA, and most people cringe...

Anyway, it should be made very clear that anyone using numeric
character references is just asking for trouble because the are
non-portable. Sadly, using Unicode can also do nothing to help here
because in that case, the numeric character references would be first
ampped to characters in the original document character set, and then
mapped to the corresponding character in the new character set
(Unicode).

Of course, Unicode makes things somewhat easier because at least you
have a fair guarantee that the character is in there. In addition, is
we used Unicode as the BASESET and specified the character references
in terms of Unicode, we'd be a lot safer still.

----
NOT speaking for EBT!