Re: Language hints in UNICODE private use area

yergeau@alis.ca
Fri, 20 Jan 95 10:34:28 EST

[Sorry if you see this twice, I seem to experience mail problems]

Gavil Nicol <gtn@ebt.com> writes:
>I have not proposed using Private Use Codes to encode hints, but
>rather I have recently proposed using two codes that can be used to
>delimit such hints in an SGML conforming manner.

It seems to me that the objections raised ("unlawful" use of Private
Area, wrong level to do that) still hold, whether you define delimiters
or the tags themselves.

>>2) How do you generalise this idea with encodings where there is no
>>bytes left for language hinting ? I can write French, Dutch, English
>>and German, for instance, using ISO-8859-1 : do I have to use Unicode
>>even in a purely European setting so that I can tag texts ? What about
>>the fact that today the text base available is mainly in ISO-latin-1 ?
>
>I do not understand the above. My proposal is strictly aimed at
>enabling display systems to make an intelligent decision as to which
>glyph image selection for CJK unified characters, though there may be
>other uses for such tags.

The problem is that the tags are needed even when not using Unicode.
You need them for glyph image selection in some Eastern European
languages, and between Urdu and Arab. You also need them for other
text processing operations (indexing, hyphenating,...). I don't think
having language tags only when using Unicode is a good thing, nor do I
think having them at two levels (Unicode and HTML) is a good thing.

>This discussion originated from my *real* proposal which is to have
>Unicode be the core character set that every browser should
>understand.

A good idea.

>This other issue is of less import, but without something, the
>Japanese will be reluctant to accept Unicode.

Fine. I, too, want language tags.

>Having Unicode be the common character set does *not* mean that
>iso8859-1 could not be used. The Accept-Charset: parameter (which
>should appear in http 1.1, and the charset= parameter on the text/html
>mime type will provide ways of allowing character set negotiation.

But with your proposal language tags would only be available when using
Unicode, and this is not sufficient. C/J/K disambiguation is not the
whole issue.

>We are not discussing the interpretation of Unicode characters, but
>rather the transfer encoding of text/html and other textual data
>sent via http.

Does saying "transfer encoding" mean that the hints should be
stripped right upon reception, before, say, saving to a file for
later perusal? Are the hints still available for a cut and paste
operation?

-- 
Francois Yergeau  <yergeau@alis.ca>
Alis Technologies Inc.
+1 514 738-9171