Re: Charsets: Problem statement/requirements?

Gavin Nicol (gtn@ebt.com)
Thu, 9 Feb 95 09:32:30 EST

Bob Jung writes:

>I believe these should be treated as a byte value in the
>"charset=something-other-than-latin1" encoding. If the content
>developer wants to specifiy a multibyte character, use something like:
>
> &#nnn&#nnn

This is quite incorrect. SGML knows nothing at all about bytes.

Joe English asks:

>How should numeric character references (&#nnn;) be interpreted in
>text/html; charset=something-other-than-latin1 ?

Well, if you have a copy of Goldfrab handy, have a look on page
161, section 4.5.2 where it says:

<quote>
If the function is wanted, a "named character reference"
incorporating the function name is used; otherwise a numeric character
reference is used, and the character is treated as data.
</quote>

and at page 357, section 9.5 note #2 which says:

<quote>
When a document is translated to a different document character set,
the character number of each numeric character reference must be
changed to the corresponding character number of the new set.
</quote>

Bottom line: the numeric character references must be mapped onto the
corresponding character in the new document character set <emph>if it
is being translated from one character set to another.</emph>.

>What if the MIME charset= parameter specifies a multibyte encoding?

The <emph>encoding</emph> has nothing to do with it because the parser
is not concerned at all with that: it only knows about characters, and
uses the specified character set to map codes to them.

>Will this break the "Added Latin 1 for HTML" entity set, which uses
>numeric character references to define all the entities?

Probably, because the MIME charset=xxxx is specifying the document
character set, and the numeric character entities will be resolved
using it (because we are not performing a translation). However, I
would assume that the HTML parser would be smart enough to map them
onto something reasonable (that's the benefit of using named character
references: changes are isolated to one spot, and in something like
HTML, where the parsers are hard coded, one could also hard code the
mappings for all supported character sets).

This assumes that the characters are available in the character set...