Re: Charsets: Problem statement/requirements?

Luke ~{B7?M~} (ylu@ccwf.cc.utexas.edu)
Thu, 9 Feb 95 19:54:56 EST

> - The TEI approach seems to be to add a LANG attribute which can
> be attached to containers i.e. <p lang=EN>. We could do this,
> but I think we'd need to define a new container of arbitrary
> length and unrelated to paragraph structure to markup words
> and other chunks of text.
>
> - We could define a new tag to indicate changes in the current
> language (and/or writing system?) i.e. <lang lang="en">
> (Is this too redundant?) or <ws lang="en">

I think this is better: <lang enc="iso-8859-1">....</lang> and <lang
enc="iso-whatever">...</lang> etc. You can even include unicode which is
iso-? this way. one of them can be set as default in <HEAD> depending on
the usage of languages. I think it's not necessary to differentiate
languages using the same encoding scheme (e.g. french and german). One
usage to to differentiate particular languages is to facilitate automatic
translation. But I think if a translator can't figure out which language
by looking at the raw bytes of a known encoding scheme, it's pretty much
useless.

> (I'm leaning toward this idea.)

Hate to say me too, but I happen to like this tagged idea. The more I
think about Unicode (since 1991) the more I think it sucks. It's not an
extensible scheme (fixed 2-byte scheme) and require too much changes to the
current software. Converting originally 1-byte charset language to 2-byte
unicode is a waste of resources. I like the tagged idea, i.e. use a
sequence of escape code to indicate the change of language settings. This
approach is very extensible, it even works with Klingon, Minbarian, etc.
languages which might as well use m.n-byte charset. ISO would then stand
for interstellar Standard Org.:-)

Seriously, HZ code which is the most widely used 7-bit chinese encoding
scheme on USENET, uses this idea i.e uses ~{ and ~} as escape sequence. It
mixes very well with iso-8895-1 character set. This can't be said with
other chinese encoding schemes eg. big5 that don't use this approach.

> - We could even combine the two and allow container attributess to override
> the current default language.
>
> In any case, I think the language attribute should have the same
> allowed values as the language/dialect in the HTTP Accept-Language and
> Content-Language headers.

yep.

> If we define a new tag we might consider if there are other attributes
> that could be used to further specify the writing system. (I was looking
> at the stuff in the Text Encoding Initiative and thinking it would
> be nice to be able to put in an HREF to one of thier writing systems defs
> but it doesn't look like they can be decoded to a usable form by
> a program. So we might swipe some ideas but not their whole scheme.)

I see the problem. i.e. you can't mix language nicely within a UR*, if
we use the <lang..> tags. Maybe we should try a different escape sequence
or use &lt;lang..&gt; as an equivalent?

__Luke

--
Luke Y. Lu  ~{B@TFI=~}        
mailto:ylu@mail.utexas.edu/
http://www.utexas.edu/~lyl/