This has been raised twice in the discussion of Unicode.
It seems that the <lang> tag and attribute proposed in the HTML3
drafts would be sufficent except that we should be referring
to the same language identifiers as HTTP.
(There's a new RFC, let me see if I can find the references...
here they are:)
See the HTML3 draft in these sections:
The Body Element and Related Elements
http://www.hpl.hp.co.uk/people/dsr/html/docbody.html
Overview of Character-Level Elements:
http://www.hpl.hp.co.uk/people/dsr/html/text.html
which says:
"LANG
This is one of the ISO standard language abbreviations, e.g. "en.uk" for
the variation of English spoken in the United Kingdom. It can be used by
parsers to select language specific choices for quotation marks, ligatures
and hypenation rules etc. The language attribute is composed from the two
letter language code from ISO 639, optionally followed by a period and a
two letter country code from ISO 3166. "
Information Type Elements
http://www.hpl.hp.co.uk/people/dsr/html/logical.html
which says:
"LANG
The <LANG> element is used to alter the language context when it is
inappropriate to do this with other character-level elements. New in 3.0. "
I expect we should adopt these and the corresponding stuff from the DTD,
except that we want to use the definition of language:
http://www.w3.org/hypertext/WWW/Protocols/HTTP1.0/HTTP1.0-ID_34.html
= =
8.2 Language Tags
A language tag identifies a natural language spoken, written, or otherwise
conveyed by human beings for communication of information to other human
beings. Computer languages are explicitly excluded. The HTTP/1.0 protocol uses
language tags within the Accept-Language and Content-Language header fields.
The syntax and registry of HTTP language tags is the same as that defined by RFC
1766 [1]. In summary, a language tag is composed of 1 or more parts: A primary
language tag and a (possibly empty) series of subtags:
language-tag = primary-tag *( "-" subtag )
primary-tag = 1*8ALPHA
subtag = 1*8ALPHA
Whitespace is not allowed within the tag and all tags are to be treated as case
insensitive. The namespace of language tags is administered by the IANA.
Example tags include:
en, en-US, en-cockney, i-cherokee, x-pig-latin
where any two-letter primary-tag is an ISO 639 language abbreviation and any
two-letter initial subtag is an ISO 3166 country code.
Note
Earlier versions of this document specified an incomplete language tag, where
values were limited to ISO 639 language abbreviations with an optional ISO 3166
country code appended after an underscore ("_") or slash ("/") character. This
format was abandoned in favor of the recently proposed standard for Internet
protocols.
= =