Followup on I18N comments

Sandra Martin O'Donnell (odonnell@osf.org)
Tue, 29 Nov 94 12:01:18 EST

I received a number of replies to my internationalization-related
comments on the HTML spec. Since some replies made similar points,
I'm sending a consolidated follow-up message that covers most
messages I received. I've replied separately to Larry Masinter
and also will send a separate reply to Dave Raggett. In what
follows, I've identified who said what.

Several people noted that the spec is supposed to document
existing practice and that's why ISO 8859-1 (Latin-1) is
listed as the supported code set. Dan Connolly then noted
that my confusion indicates the spec is not clear on that
point and recommended clarifying this. I think that's a
good suggestion. However, I believe there still are ways
to write the spec such that it describes existing practice
while leaving the door open for future enhancements. Dan
says much the same thing, but I think he believes it's
possible to add some text saying that things will be enhanced
in the future. I agree with adding that text, but I also think
some existing text needs to change a little to make future
changes possible.

On to some specific comments.

Terry Allen writes:
> Because there
> is no agreement on the string names for code sets (ISO 8859-1
> may be called any of
>
> ISO8859-1
> iso88591
> Latin-1
> 8859-1
> ISO-8859-1
>
> or something else on individual systems), OSF created a registry
>

This was unnecessary; ISO Latin 1 has a Formal Public Identifier:
ISO 8879:1986//ENTITIES Added Latin 1//EN

There are many more code sets than just ISO 8859-1. There also
are many "standard" names for that one code set. The problem is
they're all different. The OSF registry was designed to create
a consistent architecture for identifying *any* code set.
Although HTML is only concerned with Latin-1 now, it presumably
will expand to other sets in the future, and they don't all
have this Formal Public Identifier.

Dan Connolly writes:
>But the way the spec is written makes it difficult or impossible
>to support anything other than Latin-1. That's because you've
>allowed numeric character values to be used for the Latin-1
>characters. . . .

Numeric character references is an SGML mechanism, not something the
HTML community made up. SGML numeric character references are _always_
interpreted in the context of _some_ SGML declaration which specifies
the document charcter set. In HTML 2.0, all documents share the same,
implicit SGML declaration, which specifies ISO Latin 1.

Hence á only indicates a-grave as long as the SGML declaration says
that the document character set is ISO8859-1.

I believe the consensus of the working group is that we should reserve
the "charset" parameter of the text/html MIME media type for future
use. We intend to specify ways to use other character sets and encodings
in HTML documents, once we have a suitable base of experience built up.

This future use might be, for example:

Content-Type: text/html; charset="UTF-8"

which would cause the user agent to assume a different SGML declaration
from the HTML 2.0 SGML declaration.

It is evident from your confusion that the current document doesn't
make this business clear. . .

As noted, I agree with your suggestion for clarifying this in
the spec. Thanks also for the info that the numeric references
come from SGML. Even tho as an I18N person, I don't like tying
things to one specific code set, at least you allow future expansion
if you say that a "charset" parameter may be added in the future,
and that in its absence, all text is assumed to be ISO 8859-1.

. . .
>What to do about this? There are three options:
>
>1. Do nothing. This means HTML will only support Latin-1.
>That may be good enough for your community of users now,
>but it is not if you want more of the world's users to be
>able to mark up documents. If the spec remains as it is, and
>you later want to add support for more of the world, HTML
>will almost certainly have to change in some probably
>incompatible way.

For the 2.0 RFC, this is what I expect we will do. Well... we should
make it clear that there will be (upward compatible) changes in the
area of character sets and encodings, but we will not actually specify
any mechanisms.

I agree with adding info that changes will occur in the
future, but I also think some existing text needs to change
to make sure those future changes can be upwards compatible.
A prime example of this is Section 3.14 (Special Characters)
which currently says things like "The characters between the
tags represent text encoded according to ISO 8859-1..." Text
like this ties HTML to ISO 8859-1 only. If you reword it
to say that HTML currently assumes is encoded in ISO 8859-1,
that leaves the door open to support more code sets in the
future.

Sections 2.7, 3.2, 3.4.5, and 3.14.3 are among other sections
that similarly tie HTML to Latin-1. If they are reworded, it
will be easier to make upward compatible changes in the future.

>2. Use the universal code set ISO 10646 (basically the same
>as Unicode) for numeric character values.

A distinct possibility for HTML 2.1 and beyond...

>If I'm writing in Japanese, however, and can only refer to
>characters by their numeric values, the source is
>incomprehensible. It would look something like:
>
> /* random values for example only */
> <P>6e206e437934141
> <P>973387b4ff419932fff8</P>

I would expect to use the UTF-8 encoding of Unicode characters. Yes,
Japanese text would be incomprehensible to ASCII-based text viewers. I
don't see this as a problem.

I think we're talking about two different things here. I
was assuming that one *had* to use either numeric values
or entity names for characters beyond the basic "ASCII"
set. However, entity names don't exist for Japanese characters.
That therefore would leave numeric values only.

The problem is that I apparently made an incorrect assumption.
You say you expect users to be able to use actual characters,
so the source won't be full of numeric values only.

BTW, you might want to check with some Japanese users before
deciding that the UTF-8 encoding of Unicode characters is the
right solution for Japanese. Common encodings in Japan are
Shift-JIS and EUC, and there is a *far* greater installed
base of users of these encodings than there is for UTF-8.
Therefore, users will probably want something that works
with their existing software rather than an encoding that
has been dictated by a group that has few Asian representatives.

In a separate message, Dan Connolly writes:
>Section 2.6
>What units do values for attributes like MAXLENGTH and SIZE
>use? Are they numbers of bytes? The spec needs to provide that
>information. Actually, I suspect you currently assume these
>attributes are for numbers of characters, but this is incorrect
>because characters are variable (they can consume varying numbers
>of bytes), while bytes are static.
>
>Section 3.3
>The description of an SGML declaration says that names are a
>maximum of 72 characters, but that should be 72 bytes. As noted,
>characters are variable; bytes aren't.

I don't understand the conjecture "characters are variable, while
bytes are static."

Perhaps you mean that, e.g., that the byte-length of the UTF-8
encoding of a string doesn't vary linearly with number of characters
in the string. That doesn't make it any less precise to specify
lengths in characters.

No, that isn't what I mean. In code sets like Latin-1, each
character consumes one byte. In such code sets, for all practical
purposes, character == byte. This is not the case in many other
code sets or encoding methods, however. Japanese EUC is an
encoding method that allows up to four code sets to be combined
in a single document. This is a common implementation:

Code Bytes per
Set Character Contains
---- --------- --------
CS0 1 ASCII
CS1 2 JIS 208 (Japanese ideographs)
CS2 2 JIS 201 (Japanese katakana)
CS3 3 JIS 212 (more Japanese ideographs)

So some characters within a Japanese EUC data stream may
consume one byte, others consume two, and still others may
consume three. If I have a string of 72 bytes, it may contain
72 characters or it may contain 36 or 50 or something else.
Knowing the number of bytes tells me nothing about the number
of characters.

UTF-8 is another example of an encoding that allows characters
to have varying numbers of bytes. Characters may consume anywhere
from one to three bytes given the repertoire of characters
currently assigned in ISO 10646/Unicode. However, there is
a lot of space in 10646 to which characters are not now
assigned. If they were all filled in (admittedly, a *very*
remote possibility), a single UTF-8 character could be up
to six bytes long.

Does this help clarify why I say characters are variable
while bytes are static? And why the spec should in some
places refer to bytes or display columns rather than
characters?

. . .
The link is the assumed/missing/controversial "charset" parameter
which specifies how you take a MIME body of type text/html, that is, a
sequence of bytes, and translated it into an SGML entity, that is, a
sequence of characters.

In HTML 2.0, the charset parameter is (implicitly) "iso-latin-1" which
has a well-defined meaning in both the MIME and SGML camps.

The "HTML and MIME" and/or "HTML and SGML" sections should make this
clear, I suppose.

Given that you've decided HTML 2.0 is tied to Latin-1,
it is the case that bytes will equal characters. However,
I hope I've explained now why this assumption does not
work for other encodings. It's not a whole lot of work
to use the correct units in the spec now and avoid
incompatibilities in the future.

If I had my druthers, though, we sould cite the MIME and SGML specs as
normative references, provide the DTD and the MIME Content-Type
registration info, and be done with it. These terms are defined quite
nicely in the respective documents. It's really painful to reproduce the
SGML specification and the MIME specification in this HTML document.

I agree there's no point in reproducing large documents within
your own spec. However, many existing specs use terminology
incorrectly because they are not aware of I18N. If you can
fairly painlessly get this right in your spec, is there any
reason not to?

Lee Quin writes:
entity names -- are useful because
* they are humanly decipherable, e.g. I know what &egrave; means
* they are mnemonic for people who use them only rarely
* they pass through mail gateways and other 7-bit environments.

Only a small set of entity names would be humanly decipherable --
at least to most users. I assume most or all of you working on
this spec are familiar with languages that use the Latin (aka
Roman) script, so something like &egrave does have meaning to you.
Many of you probably also know Greek letters because they are so
commonly used in mathematics. So a name like &epsilon also would
have meaning. But how about, say, &jeem or &zayin? Do you know
what they mean? What about Asian ideographs? It's very common for
there to be multiple ideographs that share the same phonetic
pronunciation. So an entity name based on pronunciation is not
enough to identify a character uniquely.

My point is to beware of assumptions that entity names are
humanly decipherable and mnemonic just because the very small
set of names HTML currently supports all make sense to you.

It's true that such names can pass through 7-bit environments.
In such a case, then, do you assume that someone would use the
entity names for transport only (e.g., converting "real"
characters to entity names just before sending email, and then
converting entity names back to real characters when the document
reaches the presumably-8-bit-enabled destination)?

Probably the thing to do for HTML is to allow a CHARSET parameter in the
HTTP protocol (Mime already allows this to some extent) and to add a
Script and Language attribute to every element, so that I can put a
Hebrew quote in a Greek document, and also specify which script to use to
display Vietnamese text -- the same Unicode characters can be displayed
differently depending on the prevailing Script.

This is very ambitious. I actually would propose something
simpler for the next rev of HTML. I suggest allowing a
CHARSET parameter on a per-document basis at first, and
over time think about expanding that support to allow
code set changes within a document.

Peter Flynn writes:
> COMMENTS ON HTML SPECIFICATION -- 2.0

Thanks for the detail on 10646...I think we've all known that this is
something we have to address, but it's not a candidate for HTML 2.0.

The neatest way to go for later versions would seem to be very much like
the TEI (and other DTDs?) do: have a LANG attribute which specifies the
language using some agreed coding system (ISO 3166?) so that individual
elements can be coded this way <i lang=250>comme &ccedil;a</i> (or indeed
the whole <body>).

ISO 3166 provides abbreviations for names of countries.
ISO 639 provides abbreviations for names of languages.
You actually would need both to identify a given language.
For example, if I say "en" (the abbrev for English), do I
mean American English or British English or Australian
English, or...? If I say "CA" (the abbrev for Canada), do
I mean Canadian English or Canadian French? Neither
individual part provides enough information.

. . .
> I have questions about several of the element/tag types.
> For the ADDRESS tag, are there assumptions about the format
> of an address? For example, would HTML assume that an address
> is of this form?

No, not at all. <address> has no formatting connotations of any kind (in
fact we get flamed for not making it sensitive to linebreaks!)

Thanks for the info. This is good.

. . .
> For the BLOCKQUOTE tag, do you make assumptions about the
> appearance of quoted material? For example, is it permissible
> to use any of
>
> "quote"
> ,,quote''
> <<quote>>
>
> or some other appearance? Or do you assume "quote" is the
> only way quotes are formatted?

Either of the first two is up to the user. The second would be illegal in
HTML (but &lt;&lt;quite&gt;&gt; would be OK) but <blockquote> just marks
the content as a quotation to be indented: it does not automate quotation
marks in any browser I have seen.

Oops, the last example shows the limitations of using only
"ASCII" characters so that things can get through email.
I don't really mean for those to be two less-than and two
greater-than signs. I meant for them to be left and right
guillemets, as are used around quotes in French text. I
just used the closest equivalent of guillemets that I could.

. . .
> Can any letter be used in a name? I'm sure the ASCII letters
> A-Z and a-z are permissible, but what about Latin-1 letters
> like a-acute and C-circumflex? Are they okay? How about other
> letters like a-ogonek and S-caron (both in Latin-2)?

No, the SGML Declaration for HTML describes the allowed characters.

Would it be difficult to list the allowed characters in the
HTML spec? I think it would be useful.

. . .
> Section 3.6.5
> Please be aware that the "typical" renderings the spec lists
> for various levels of headings is very American- and European-
> centric.

Yes. These are the countries which have contributed most to the
discussion so far, so they are the ones most represented. I for one
would be very happy for someone with a good background in
non-Latin-alphabet languages to start looking at solutions.

You seem to be assuming that this is a well-advanced project with
limitless time and resources :-)

Good heavens, I don't know how I gave you that impression. :-)
One always hears legends about projects with limitless time
and resources, but I've never worked on one. Instead, there's
always way too much to do and not enough people to do it.

I understand that the HTML spec cannot be perfectly
internationalized (or probably perfectly anything-ed) in
2.0. However, if there are sections that can be worded to
make it easier to expand support in the future, it seems
sensible to put in such wording.

. . . Most of the N.Americans who have
taken over the project don't really appear much interested in the use of
the Web for non-Latin-1 work, alas.

This is common. Most people concentrate on solving their
own problems first. Many are not even aware of the
requirements of other languages and cultures. The problem
is that in solving their own problems, developers often
write specs or code that are needlessly limited. I'm trying
to help make you aware of some I18N needs so the HTML spec
and W3 can more easily grow to support international needs.

. . .
> Section 5.1.3
> Why do you propose adding the no-break space and soft hyphen
> as additional special characters? What purpose will they serve?
> My initial reaction is to recommend against them, but I'd like
> to hear how they are intended to be used.

The no-break space disallows browsers from breaking a line at that point.
The soft hyphen permits hyphenation of a word at that point but the s-h
vanishes if it is not needed.

The problem is that these characters are not available in
all code sets. It's one thing to require support for characters
like greater-than or less-than; the graphic characters in ASCII
are available in nearly every code set and encoding method.
However, if HTML mandates support for the no-break space and
soft hyphen, it may be creating a problem in the future for
those who use code sets that don't have these characters. I
recommend making support for these characters optional; or
perhaps adding something that indicates what their purpose
would be if they existed in the code set you were using.

Bert Bos writes:
. . .
Until we can confidently specify Unicode as the character set for
HTML, we'll have to support as many entities as possible. I'd say we
should simply include the whole ISO set. Maybe HTML 4.0 or 5.0 can
drop all entities, except for group (3).

Ummm, I'm not aware of entity names other than those for
ISO 8859-1 (Latin-1). What do you mean when you recommend
including "the whole ISO set"? Do you know of some other
names? If so, I'd like to learn more about them.

Also, I would caution you about assuming that the future
solution is specifying Unicode as the code set for HTML.
I do think it's the right solution for numeric values,
but most people have no need for a universal code set,
and will continue using the same code sets they are using
today. I think it's better to design a mechanism for
supporting multiple code sets than to force everyone to
use one. (I will freely acknowledge that some other i18n
people might argue with me on this point.)

Thank you for all your comments. I hope this has helped
clarify some points.

-- Sandra

---------------------------------------------------------------------
Sandra Martin O'Donnell email: odonnell@osf.org
Open Software Foundation phone: +1 (617) 621-8707
11 Cambridge Center fax: +1 (617) 225-2782
Cambridge, MA 02142 USA
---------------------------------------------------------------------