Re: Is LANG appropriate for non-human languages?

Martin J Duerst (
Wed, 2 Aug 95 09:09:25 EDT

>I was thinking about the question of setting phonetic information in
>HTML. Phonetic information is rendered in print using IPA
>(International Phonetic Alphabet), which is a subset of Unicode/10646.
>Since most most users are not equipped to handle these characters, an
>informal (but written) standard has emerged on Usenet which
>transcribes the IPA characters into 7-bit ASCII. This scheme is used
>on alt.usage.english and sci.lang (at least).
>My thoughts on rendering phonetics on a web page are:

> - If I have the capability of seeing IPA, I would like to do so
> even if the author, lacking the ability to enter the characters,
> used the ASCII transcription.

What is needed is a form of transcription, and not necessarily
the existing ASCII transcription. There existed numerous ways
to denote more commonly used characters, such as the accented
characters of European languages, yet HTML, in accordance to
SGML, has choosen specific ways to represent them: Numeric character
references (e.g. ɐ for the turned 'a') are available on any
browser that supports Unicode/ISO 10646. Character entities
might be defined in a future version of HTML, as they are currently
defined for Latin-1 characters in HTML 2.0, and for Math characters
in HTML 3.0.
Further support for any special notation is neither necessary, nor
do I think it is advisable, as that would innundate us with requests
for special notations from all kinds of other areas. Also, such a
notation would consist an undesired blur in the distinction between
character encodings (outside HMTL, mainly a HTTP business) and
HTML itself. It would of course be a good idea to provide converters
from the old ASCII scheme to SGML/HTML notation.

> - If I do not have the capability of seeing IPA, when presented
> with such, I would prefer to see the transcription rather than a
> mass of "unknown character" glyphs.

This is up to the browser, or even the underlying rendering framework.
It should not have anything to do with HTML (other than we need some
degree of Unicode/ISO 10646 support, which we already have more or
less in HTML 2.0).
For some thoughts on how to integrate transcription into a text
rendering framework, see my upcomming paper at the Unicode
conference in San Jose in September. [An electronic version is
available upon personal request.]

> - If my browser supports it, I would like to be able to select a
> phonetic transcription and hear an automatically generated
> rendering.

Again, this is purely a browser problem. HTML is not affected.

>Clearly these are not things that can be required of a browser, but as
>they are useful, it would be a shame to preclude a browser supporting

(2) and (3) are absolutely unprecluded. For (1), the only thing necessary
is a change from one notation to another, which is required for the
average HTML document anyway.

>It would seem that the cleanest way to insert this ability in HTML is
>to regester two new languages "i-phon-ipa" and "i-phon-ascii" and then

There is no need for "i-phon-ipa". Although ipa uses characters from
ASCII and other Latin areas besides the special ipa characters in
ISO 10646, the average word or sencence written in ipa will contain
quite a few characters or modifiers from the ipa range.
For presentation purposes, it may be advisable to set a specific font
on stretches of IPA text, but this should be done, in my oppinion,
by subclassing as provided in HTML 3.0, and not by a Lang tag.
It could be similar in line to using a cursive font with many
ligatures in a "sonet" subclass.

Also, what you propose is mixing up character encoding issues (outside
of HMTL) and language. Phonetic notation not only is not a (natural)
language, in the way a programming language is not a natural
language. Much more importantly, phonetic information is usually
tied to a language. Using the language tag in the way you do it
above, you preclude much more natural and obvious uses
such as

My name is pronounced
<lang lang=en-us><!-- some ISO 10646 IPA charcters --></lang>
but the French would pronouce it
<lang lang=fr><!-- some ISO 10646 IPA charcters --></lang>

>Reading rfc1766, it seems to be intended to cover only human
>languages, although it does make a distinction between script
>variants, as "az-arabic" and "az-cyrillic".

I am kind of ignoring what this distinction could be useful for,
unless there is a innate difference (e.g. vocabulary, grammar,...)
between Azerbaijani when written with Arabic or Cyrillic characters.
It is certainly not needed when you code the characters in the script
you want them to be (using Unicode/ISO 10646 or something like
ISO 2022).

>Is this a reasonable use
>of this attribute? If not, is there a better way to do this in HTML?

See above. Regards, Martin.

---- Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: