Language Markup

Albert Lunde (
Fri, 28 Jul 95 21:51:51 EDT

> | 7. Internationalization.
> |
> | Currently Netscape and Spyglass Mosaic will accept different
> | character encoidings. They use a user-overridable heurstic to
> | guess the character set, but will work properly if they
> | see a charset= parameter on the MIEM content-type.
> | No one in the roomwas doing anything to further any spec writing here,
> | and previous volunteers had gone quiet. Dan Connolly volunteered
> | to see to the writing of a spec for the charset parameter, by delegation
> | or action as need be, by December.
> |
> | Language tags were a different issue. They should not be lumped
> | together with "charset" or SUP/SUB as one would hold the otehr back.
> | Some work on laguage tags has been done and should be
> | disenterred.
> |
> | There is a problem with unicode that it uses the same code to
> | represent different glyphs as a function of the language being Chinese
> | or Korean. EBT solve this problem by regarding it as a font
> | choice issue in a style sheet. Noone else had a solution.
> I believe this is where you need to know what language you are in.
> So language attributes may well be needed. Glenn Adams, some
> mail to you bounced recently; are you still around?

This has been raised twice in the discussion of Unicode.

It seems that the <lang> tag and attribute proposed in the HTML3
drafts would be sufficent except that we should be referring
to the same language identifiers as HTTP.

(There's a new RFC, let me see if I can find the references...
here they are:)

See the HTML3 draft in these sections:

The Body Element and Related Elements

Overview of Character-Level Elements:
which says:

This is one of the ISO standard language abbreviations, e.g. "" for
the variation of English spoken in the United Kingdom. It can be used by
parsers to select language specific choices for quotation marks, ligatures
and hypenation rules etc. The language attribute is composed from the two
letter language code from ISO 639, optionally followed by a period and a
two letter country code from ISO 3166. "

Information Type Elements
which says:
The <LANG> element is used to alter the language context when it is
inappropriate to do this with other character-level elements. New in 3.0. "

I expect we should adopt these and the corresponding stuff from the DTD,
except that we want to use the definition of language:

= =
8.2 Language Tags

A language tag identifies a natural language spoken, written, or otherwise
conveyed by human beings for communication of information to other human
beings. Computer languages are explicitly excluded. The HTTP/1.0 protocol uses
language tags within the Accept-Language and Content-Language header fields.

The syntax and registry of HTTP language tags is the same as that defined by RFC
1766 [1]. In summary, a language tag is composed of 1 or more parts: A primary
language tag and a (possibly empty) series of subtags:

language-tag = primary-tag *( "-" subtag )
primary-tag = 1*8ALPHA
subtag = 1*8ALPHA

Whitespace is not allowed within the tag and all tags are to be treated as case
insensitive. The namespace of language tags is administered by the IANA.
Example tags include:

en, en-US, en-cockney, i-cherokee, x-pig-latin

where any two-letter primary-tag is an ISO 639 language abbreviation and any
two-letter initial subtag is an ISO 3166 country code.

Earlier versions of this document specified an incomplete language tag, where
values were limited to ISO 639 language abbreviations with an optional ISO 3166
country code appended after an underscore ("_") or slash ("/") character. This
format was abandoned in favor of the recently proposed standard for Internet
= =