Re: Character encoding and entities

Gavin Nicol (gtn@ebt.com)
Tue, 18 Jul 95 18:45:59 EDT

>Japanese text on the World-Wide Web (when served using ISO-2022-JP)
>may contain special characters like <, >, and &. Commonly, it appears
>that people leave these characters in their text, and then others have
>to fix their browsers[1] to interpret markup characters only outside
>of JIS text.
>
>However, i've seen the opposite solution with Chinese text, which may
>also include <, >, or &. For instance, at [2] these three characters,
>when encountered in Hz-encoded text from the GB character set, are
>escaped as the entities &lt;, &gt; and &amp; respectively.
>
>From my experience, the former treatment is more widespread than the
>latter. But the latter ensures that there is no chance of documents
>breaking parsers, while occasionally these problems occur in the
>former case. Does this mean the latter is more correct?

This depends very much on the encoding used, because in some encodings
the characters you mention could occur as part of a multi-byte
character encoding, and the parser should never see them in that case.

If these characters occur *after* the text has been decoded, then it
they should probably be replaced by entity references (using entities
is generally safer than using special characters).

As you have seen, the Japanese WWW is non-standard in some ways.