Re: Language Markup

lilley (lilley@afs.mcc.ac.uk)
Mon, 31 Jul 95 06:46:26 EDT

Terry Allen wrote:
>[no attributions remain, alas]>

> In draft-ietf-html-specv3-00.txt's DTD there is a LANG element referred
> to but not defined (not a formal SGML error, but surely an oversight):
>
> <!ENTITY % misc "Q | LANG | AU | DFN | PERSON | ACRONYM | ABBREV | INS | DEL">

> and a common attribute lang:
>
> lang CDATA "en.us" -- ISO language, country code --

> LANG as an element is Not a Good Idea,

I agree, because a particular piece of text may comprise several
paragraphs, headings, lists etc all in the same language. I would have
thought that opening and closing multiple lang elements would not only
be tedious but would also imply a sequence of language portions, all of
which happen to be the same language, rather than a single portion.

Would a lang attribute to RANGE or SPOT be appropriate here? We currently
have: [1]

<!ELEMENT RANGE - O EMPTY>
<!ATTLIST RANGE
id ID #IMPLIED -- for naming marked range --
class NAMES #IMPLIED -- for subclassing --
from IDREF #REQUIRED -- start of marked range --
until IDREF #REQUIRED -- end of marked range --
>

Once could then mark any arbitrary chunk of a document as being in a
particular language. Further, the appropriate range being in the HEAD
could be searched for; find me all documents with passages in Breton,
for example.

> I suggest then that a new common attribute LANG be added to all
> elements (is it appropriate for all the FORM elements?) as part of
> the move to 10646 as the document charset.

I think it is already an attribute of most elements. I agree it is more
consistent if it applies to all elements. Do you really mean *all*
all though? <HTML LANG="fr.ca"> ?

> | Information Type Elements
> | http://www.hpl.hp.co.uk/people/dsr/html/logical.html
> | which says:
> | "LANG
> | The <LANG> element is used to alter the language context when it is
> | inappropriate to do this with other character-level elements. New in 3.0. "
>
> This proposal should be assimilated to the various proposals tossed around
> for a PHRASE, SEM, etc. element. An element intended only to change
> the language, when there is a common attribute that can do the job,
> is unneeded, whereas the ability to mark up a phrase or word or letter
> is surely needed.

Yes.

> | I expect we should adopt these and the corresponding stuff from the DTD,
> | except that we want to use the definition of language:
> | http://www.w3.org/hypertext/WWW/Protocols/HTTP1.0/HTTP1.0-ID_34.html
>
> I think not for this purpose, as it allows:
>
> >en, en-US, en-cockney, i-cherokee, x-pig-latin

Is there any particular reason that the HTML and HTTP groups are
using different syntax for this?
>
> Now, for really good language markup, you need to be able to do this;
> the TEI-L folks get even more expansive (along the lines of
> en-cockney-1890s). Just the other day I remarked that I might
> want to claim I was writing in en.us.calif.

I think it would be valuable to allow this sort of markup, provided it is
hierarchical. Providing rendering hints to browsers is just one use. Being
able to search a document on the language used is another.

If someone happens to have the information handy to mark up the language
used in detail:

<p>Then she said <q lang="en.gb.manchester.openshaw.working-class.1990s">
No yer mong, give er back ers daldie</q> which, I believe, means
<q lang="en.gb.queens">No you fool, return her comforter</q></p>

I see no reason why that information should be omitted or truncated down to
en.gb in both cases. A rendering engine is surely capable of parsing as far
along the lang attribute as it needs or is able, ignoring the rest.

> However, here the need is not to distinguish dialects but to distinguish
> major language divisions so as to know what font to use for one of the
> "unified" characters. So the value could well be restricted to that cited
> above, s.v. LANG attribute.

Disagree. *A* need is to select the right font. Other needs are also
served by the lang attribute, and once the Web throws off its "parts of
Western Europe" bias and becomes a World Wide Web, these will become
more important. I do not see any benefit to restricting the language
information, provided it is structured such that the required glyph
usage can also be inferred.

> [language from ...HTTP1.0-ID_34.html deleted, leaving aside the issue
> of what good en-cockney is in *HTTP* ...]

200 Awright me old sparrer?
403 No chance, squire!

;-)

[1] http://www.hpl.hp.co.uk/people/dsr/html/html3.dtd

-- 
Chris Lilley, Technical Author
+-------------------------------------------------------------------+
|       Manchester and North HPC Training & Education Centre        |
+-------------------------------------------------------------------+
| Computer Graphics Unit,             Email: Chris.Lilley@mcc.ac.uk |
| Manchester Computing Centre,        Voice: +44 161 275 6045       |
| Oxford Road, Manchester, UK.          Fax: +44 161 275 6040       |
| M13 9PL                            BioMOO: ChrisL                 |
|     URI: http://info.mcc.ac.uk/CGU/staff/lilley/lilley.html       | 
+-------------------------------------------------------------------+