Re: Comments on: "Character Set" Considered Harmful

Amanda Walker (amanda@intercon.com)
Mon, 17 Apr 95 10:32:23 EDT

> Is it not the case that there is no SGML text entity that is not
> a representation of characters? We're not dealing with the Aness of
> A, the Bness of B, but with representations of A and B. SGML is
> about manipulating representations of characters.

[This conversation is getting oddly neo-Platonic for an IETF working group :)]

One of the difficulties in this whole discussion is that there are mulple
levels of abstraction, and different people find different boundaries between
them to be significant. I, for example, find much of Dan's discussion of the
theoretical underpinnings of coded character sets to be precise but largely
irrelevant to the issues at hand. On the other hand, there are people who no
doubt view my focus on multilingual capability and a single universal
character encoding (namely IS 10646) to be peculiar, since it seems obvious
that the actual encoding(s) used is of no theoretical import--it's a purely
pragmatic issue, and hence not interesting :).

I'll quickly admit that on the particular issue of coded characters sets that
I am being purely pragmatic. I am not particularly concerned (in this
context) with the essential nature or philosophy of text, or even of
electronic representations of text. I am, rather, concerned with a small set
of pressing pragmatic issues. Principal among them is simply being able to
determine unambiguously what characters are being represented in an HTML
document so that I can display them. This is mostly a labelling issue,
although numeric character references are a problem--a problem can be
pragmatically solved by restricting HTML to a single (large) document
character set.

The status quo in this regard is broken. As anyone who has tried to implement
Japanese support in their browser can confirm, there is a lot of content out
there whose interpretation cannot be determined unambiguously by software.
This is bad.

To give a concrete example, the Macintosh on which I am typing this message
can handle multilingual text just fine. At the moment, it has fonts & input
methods installed for European, Russian, Hebrew, Arabic, and Japanese. There
are HTML documents in existence that contain content in one or more of these.
All I want right now is some method for determining how to match them up. So
far, what we do is cheat. ISO 2022 is easy to automatically detect even in
mislabeled text, and is reasonably popular, so we've started with Japanese.
There's only so far we can go with clever inferences, though.

I don't mind translating between the transport representation and IS 10646, so
that the SGML layer only sees a sequence of IS 10646 code points. That's
simple. What I do mind is endless discussion about the distinctions between
characters, glyphs, codes, and the essential nature of reality, even though in
other contexts I may care greatly about such issues. They simply do not
address the issue at hand (which Gavin's proposal does, as I see it).

I'm not trying to squelch anyone, I just think we're getting a bit far afield.

Amanda Walker
InterCon Systems Corporation