Re: Character encoding and entities

Amanda Walker (amanda@intercon.com)
Tue, 18 Jul 95 18:05:37 EDT

> Japanese text on the World-Wide Web (when served using ISO-2022-JP)
> may contain special characters like <, >, and &. Commonly, it appears
> that people leave these characters in their text, and then others have
> to fix their browsers[1] to interpret markup characters only outside
> of JIS text.

Correct. This shouldn't be a large change to the browser, actually. Handling
non-delimited two-byte characters (Shift-JIS and EUC) is much more annoying.
With ISO 2022 you just treat anything not in ASCII or JIS Roman as a character,
not markup.

> >From my experience, the former treatment is more widespread than the
> latter. But the latter ensures that there is no chance of documents
> breaking parsers, while occasionally these problems occur in the
> former case. Does this mean the latter is more correct?

I think they are both stopgap measures. I am strongly in favor of using
ISO 10646 with language tags, but of the older schemes ISO 2022 is much
better than non-delimited schemes. As long as you can unambiguously tell
what size each character is, it's not too hard to make your parser handle
wide characters. Having to guess is a pain in the neck.

If you are concerned with not "breaking" non-multilingual browsers, please
also allow multilingual ones to do the right thing without having to be
preconfigured for a particular encoding. This is the biggest headache
with trying to support Japanese right now, and I'd hate to see it happen
with Chinese, Korean, or other languages.

Amanda Walker
InterCon Systems Corporation