Re: Language Markup

Terry Allen (terry@ora.com)
Fri, 28 Jul 95 23:20:17 EDT

Thank you, Albert! this is a great help.

| > | 7. Internationalization.
...
| > | Language tags were a different issue. They should not be lumped
| > | together with "charset" or SUP/SUB as one would hold the otehr back.
| > | Some work on laguage tags has been done and should be
| > | disenterred.
| > |
| > | There is a problem with unicode that it uses the same code to
| > | represent different glyphs as a function of the language being Chinese
| > | or Korean. EBT solve this problem by regarding it as a font
| > | choice issue in a style sheet. Noone else had a solution.
| >
| > I believe this is where you need to know what language you are in.
| > So language attributes may well be needed.
|
| This has been raised twice in the discussion of Unicode.
|
| It seems that the <lang> tag and attribute proposed in the HTML3
| drafts would be sufficent except that we should be referring
| to the same language identifiers as HTTP.

In draft-ietf-html-specv3-00.txt's DTD there is a LANG element referred
to but not defined (not a formal SGML error, but surely an oversight):

<!ENTITY % misc "Q | LANG | AU | DFN | PERSON | ACRONYM | ABBREV | INS | DEL">

and a common attribute lang:

lang CDATA "en.us" -- ISO language, country code --

LANG as an element is Not a Good Idea, but lang as an attribute is just
what we want. Because of the restrictions on content described below
the attribute could better be defined as a NAME.

I suggest then that a new common attribute LANG be added to all
elements (is it appropriate for all the FORM elements?) as part of
the move to 10646 as the document charset.

| (There's a new RFC, let me see if I can find the references...
| here they are:)
|
| See the HTML3 draft in these sections:

(I always use the draft as found at ds.internic.net.)

| The Body Element and Related Elements
| http://www.hpl.hp.co.uk/people/dsr/html/docbody.html
|
| Overview of Character-Level Elements:
| http://www.hpl.hp.co.uk/people/dsr/html/text.html
| which says:
|
| "LANG
| This is one of the ISO standard language abbreviations, e.g. "en.uk" for
| the variation of English spoken in the United Kingdom. It can be used by
| parsers to select language specific choices for quotation marks, ligatures
| and hypenation rules etc. The language attribute is composed from the two
| letter language code from ISO 639, optionally followed by a period and a
| two letter country code from ISO 3166. "

That's fine.

| Information Type Elements
| http://www.hpl.hp.co.uk/people/dsr/html/logical.html
| which says:
| "LANG
| The <LANG> element is used to alter the language context when it is
| inappropriate to do this with other character-level elements. New in 3.0. "

This proposal should be assimilated to the various proposals tossed around
for a PHRASE, SEM, etc. element. An element intended only to change
the language, when there is a common attribute that can do the job,
is unneeded, whereas the ability to mark up a phrase or word or letter
is surely needed. However, I think no element need be added solely
for the move to 10646; it can follow later on.

| I expect we should adopt these and the corresponding stuff from the DTD,
| except that we want to use the definition of language:
| http://www.w3.org/hypertext/WWW/Protocols/HTTP1.0/HTTP1.0-ID_34.html

I think not for this purpose, as it allows:

>en, en-US, en-cockney, i-cherokee, x-pig-latin

Now, for really good language markup, you need to be able to do this;
the TEI-L folks get even more expansive (along the lines of
en-cockney-1890s). Just the other day I remarked that I might
want to claim I was writing in en.us.calif.

However, here the need is not to distinguish dialects but to distinguish
major language divisions so as to know what font to use for one of the
"unified" characters. So the value could well be restricted to that cited
above, s.v. LANG attribute.

Would someone (Albert?) like to construct a counterargument that
en-cockney and the like are desirable in HTML markup? in the absence
of any standard or common list of values following the hyphen?

Note that a later change NAME>CDATA would be backward compatible,
whereas the reverse is not true.

[language from ...HTTP1.0-ID_34.html deleted, leaving aside the issue
of what good en-cockney is in *HTTP* ...]

Regards,

-- 
Terry Allen  (terry@ora.com)   O'Reilly & Associates, Inc.
Editor, Digital Media Group    101 Morris St.
			       Sebastopol, Calif., 95472

A Davenport Group sponsor. For information on the Davenport Group see ftp://ftp.ora.com/pub/davenport/README.html or http://www.ora.com/davenport/README.html