Re: Language hints in UNICODE private use area

Gavin Nicol (
Thu, 19 Jan 95 16:36:36 EST

>I tend to wholeheartly agreed with David's thought. I have great
>reservations with the idea of using the UNICODE private use area for
>encoding language hints. I have basically four reasons :

I have not proposed using Private Use Codes to encode hints, but
rather I have recently proposed using two codes that can be used to
delimit such hints in an SGML conforming manner. This allows us to
unambiguously determine that these tags are hints at every level of
the system. If we wanted to use actual codes, we might need about 18
characters, but I find this a rather distasteful solution.

>1) As David mentionned Unicode was explicitly designed not to address
>this issue.

And I am not asking Unicode to solve it. If you like, think of my
recent ideas as formats for tags, or perhaps ideas for extending the
UCS-2 *encoding* of Unicode.

>2) How do you generalise this idea with encodings where there is no
>bytes left for language hinting ? I can write French, Dutch, English
>and German, for instance, using ISO-8859-1 : do I have to use Unicode
>even in a purely European setting so that I can tag texts ? What about
>the fact that today the text base available is mainly in ISO-latin-1 ?

I do not understand the above. My proposal is strictly aimed at
enabling display systems to make an intelligent decision as to which
glyph image selection for CJK unified characters, though there may
be other uses for such tags.

>3) It is easy to add, in upward compatible fashion, a tag called, for
>example, <lang=...>. Browsers that do not understand the tag will
>simply ignore it.

Requiring this will potentially complicate every single DTD in what
is essentially an infinite set. Limiting this discussion to just a
single DTD is pointless. We need a general solution.

>4) I have the impression that this may not be the proper forum
>(html-wg, http-wg) to discuss changes of interpretation of Unicode
>characters or codes. I am not convinced that these changes will easily
>be accepted by the Unicode consortium. It might be much easier to
>create an html tag for this purpose.

This discussion originated from my *real* proposal which is to have
Unicode be the core character set that every browser should
understand. This other issue is of less import, but without
something, the Japanese will be reluctant to accept Unicode. Having
Unicode be the common character set does *not* mean that iso8859-1
could not be used. The Accept-Charset: parameter (which should
appear in http 1.1, and the charset= parameter on the text/html mime
type will provide ways of allowing character set negotiation.

We are not discussing the interpretation of Unicode characters, but
rather the transfer encoding of text/html and other textual data
sent via http.