Re: Charsets: Problem statement/requirements?

Gavin Nicol (gtn@ebt.com)
Fri, 10 Feb 95 13:37:09 EST

>I think this is better: <lang enc="iso-8859-1">....</lang> and <lang
>enc="iso-whatever">...</lang> etc.

I notice a suspicious lack of anything other than iso in the above...

>You can even include unicode which isiso-?this way.

Incorrect. You can include a iso-* compatible set of byte values that
can be mapped to Unicode characters.

>I think it's not necessary to differentiate languages using the same
>encoding scheme (e.g. french and german). One usage to to
>differentiate particular languages is to facilitate automatic
>translation. But I think if a translator can't figure out which
>language by looking at the raw bytes of a known encoding scheme, it's
>pretty much useless.

And how is the translator supposed to do this if the encoding, or the
character set don't provide the information? You seem to have
contradicted yourself in a single paragraph...

>Hate to say me too, but I happen to like this tagged idea.

Tags are fine. They have a place, but not to allow <emph>multiple
character sets</>.

>The more I think about Unicode (since 1991) the more I think it
>sucks. It's not an extensible scheme (fixed 2-byte scheme) and
>require too much changes to the current software.

Bollocks. The conversion to Unicode is usually less expensive, and
less difficult than supporting a myriad character sets and encodings
<emph>because the framework is the same, except that in Unicode, you
only have a single character set to worry about.</> Unicode has a
finite set of codes, that is true. So does any character set. Unicode
covers a very large portion of the languages currently used in an
admirable manner though.

>Converting originally 1-byte charset language to 2-byte unicode is a
>waste of resources. I like the tagged idea, i.e. use a sequence of
>escape code to indicate the change of language settings. This
>approach is very extensible, it even works with Klingon, Minbarian,
>etc. languages which might as well use m.n-byte charset. ISO would
>then stand for interstellar StdardmOrg.:-)

The extensibility is the very thing that kills it. Read ISO 2022, and
weep.

>Seriously, HZ code which is the most widely used 7-bit chinese encoding
>scheme on USENET, uses this idea i.e uses ~{ and ~} as escape
>sequence. It mixes very well with iso-8895-1 character set. This
>can't be said with other chinese encoding schemes eg. big5 that
>don't use this approach.

Sure. So does does EUC, ISO-2022-JP, and a myriad other encodings of
various characters sets. The "myriad" and "various" in the last
sentence should be taken very seriously...

> - We could even combine the two and allow container attributess to override
> the current default language.
>
> In any case, I think the language attribute should have the same
> allowed values as the language/dialect in the HTTP Accept-Language and
> Content-Language headers.

>I see the problem. i.e. you can't mix language nicely within a UR*, if
>we use the <lang..> tags. Maybe we should try a different escape sequence
>or use &lt;lang..&gt; as an equivalent?

I see another problem: name a single character set that allows multiple
languages....