It's Zen: "Ask thyself, what is a document? An SGML document?"
>I'll quickly admit that on the particular issue of coded characters
>sets that I am being purely pragmatic.
Pragmatism (with regards to multilingual support) is the only paying
solution: or so my experience leads me to believe.
>The status quo in this regard is broken. As anyone who has tried to
>implement Japanese support in their browser can confirm, there is a
>lot of content out there whose interpretation cannot be determined
>unambiguously by software. This is bad.
Sad but true. Even sadder, many people here don't realise it's a
problem..
>ISO 2022 is easy to automatically detect even in mislabeled text, and
>is reasonably popular, so we've started with Japanese. There's only
>so far we can go with clever inferences, though.
Yes.
>I don't mind translating between the transport representation and IS
>10646, so that the SGML layer only sees a sequence of IS 10646 code
>points. That's simple. What I do mind is endless discussion about
>the distinctions between characters, glyphs, codes, and the essential
>nature of reality, even though in other contexts I may care greatly
>about such issues. They simply do not address the issue at hand
>(which Gavin's proposal does, as I see it).
My proposal late last year was that all browsers be able to
understand 3 Unicode encodings, and that all servers of multilingual
data be able to convert in to it. I still think this is the best
case because *all* servers and client would be able to talk to each
other.
This idea was quickly crushed.
My latest proposal is for the document character set to be ISO
10646. This does not *require* conversion to ISO 10646 internally,
though that is obviously the ideal. It allows all current
multilingual data to be legal, with perhaps minor tweaks required to
some browsers. This is pragmatism :-)
The theory behind why that works is where the Zen happens, and I
don't think most people are all *that* interested in it. I use the
abtractions primarily to legalise the pragmatism...