HTML 2.0 comments (First of two)

Sandra Martin O'Donnell (odonnell@osf.org)
Wed, 23 Nov 94 13:57:30 EST

I recently had a chance to read the HTML 2.0 specification, and
have some serious concerns about its design with respect to
internationalization (I18N) issues. I handle I18N at OSF, and
have some suggestions for ways to change the HTML spec to
accommodate international requirements better. My suggestions
fall into two categories -- an overall design issue covered in
this message, and a separate set of comments on individual
sections in the spec (emailed separately). Please let me know
if you have comments or questions.

Best regards,
-- Sandra

---------------------------------------------------------------------
Sandra Martin O'Donnell email: odonnell@osf.org
Open Software Foundation phone: +1 (617) 621-8707
11 Cambridge Center fax: +1 (617) 225-2782
Cambridge, MA 02142 USA
---------------------------------------------------------------------

COMMENTS ON HTML SPECIFICATION -- 2.0
(First of two)

After reading the HTML spec, I have one overall concern that
affects many sections. Currently, the code set ISO 8859-1
(Latin-1) is listed as the one that HTML supports. The spec
permits documents to include any Latin-1 character, and lists
the entity names and encoded values for each character in the
Latin-1 repertoire.

I assume the working group did this to address international
needs as you saw them. However, there is much more than just
English and Western European languages. Under this spec, I
don't see any way for HTML to handle, say, Japanese, Chinese,
Arabic, Polish, Hungarian, Russian, or any other non-Western
European language. As WWW grows, and HTML's use as the document
mark-up format increases, it becomes increasingly important
to be able to handle more than Latin-1.

But the way the spec is written makes it difficult or impossible
to support anything other than Latin-1. That's because you've
allowed numeric character values to be used for the Latin-1
characters. The problem is that many code sets use the same
numeric values for their own characters, but since HTML says
the values are Latin-1 and only Latin-1, these other code sets
can't be supported.

Here's the situation. In the beginning, there was ASCII, and
it used 128 of the possible 256 code values available in an
eight-bit byte. When other code sets came along, most added
characters in the range 128-255. This is true of Latin-1,
Latin-2, and all the other sets in the ISO 8859 series. It's
also true of many Asian encoding methods. EUC, which is used
in Japan, Korea, China, and Taiwan, requires that all bytes
of all non-ASCII characters must have the high bit on.
Therefore, all Asian characters in EUC consist of bytes that
are always in the range 128-255.

So Latin-1 and nearly every other code set or encoding method
except ASCII use values in the range 128-255. But if HTML
requires values in that range to be interpreted as Latin-1,
it is not possible to support anything other than Latin-1.
Here are some differing assignments for the decimal code value
225:

Code Set Character at 225
-------- ----------------
ISO 8859-1 a-grave
ISO 8859-5 Cyrillic es
ISO 8859-6 Arabic feh
ISO 8859-7 Greek alpha
Macintosh Roman middle dot (punctuation)
Microsoft Roman German Eszett
Japanese EUC second byte of katakana me

There are many other examples, but this is probably enough
to make the point. Numeric character values are not unique,
but the HTML spec requires that each value be interpreted
uniquely.

What to do about this? There are three options:

1. Do nothing. This means HTML will only support Latin-1.
That may be good enough for your community of users now,
but it is not if you want more of the world's users to be
able to mark up documents. If the spec remains as it is, and
you later want to add support for more of the world, HTML
will almost certainly have to change in some probably
incompatible way.

2. Use the universal code set ISO 10646 (basically the same
as Unicode) for numeric character values. ISO 10646 contains
nearly all characters used in nearly all languages around the
world. HTML could designate that if a character is referred to
by a numeric value, that value is the one it would have in
ISO 10646. In this case, the characters listed earlier as
having the decimal value 225 (hex 0xe1) in smaller code sets
would all have unique values in ISO 10646. Here are the values
using hex notation:

Value in ISO 10646 Character
------------------ ---------
00e1 a-grave
0441 Cyrillic es
0641 Arabic feh
03b1 Greek alpha
00b7 middle dot (punctuation)
00df German Eszett
ff92 katakana me (entire character)

The advantage of ISO 10646 values is that they are unique and
cover virtually every character in use on computers today.

A potential disadvantage of these values is that very few systems
currently support 10646's full repertoire of characters, and
using the 10646 values in HTML might imply that such full support
is required. That would be an unrealistic requirement for the
foreseeable future.

Further, it may send a bad cultural message to use 10646
numeric values alone. Although there are entity names for
the Latin-1 characters, I'm not aware of such names for other
characters like those in Japanese, Arabic, Korean, and so on.
Assuming there are no entity names for these characters,
that means the only way to refer to them in an HTML document
would be via the 10646 numeric values. Even though HTML is
a mark-up language that is used to produce processed output,
much of the source is human-readable. For example,

<P>This is the text of the first paragraph.
<P>This is the text of the second paragraph.</P>

If I'm writing in Japanese, however, and can only refer to
characters by their numeric values, the source is
incomprehensible. It would look something like:

/* random values for example only */
<P>6e206e437934141
<P>973387b4ff419932fff8</P>

That's not very human-friendly.

3. Remove the ability to refer to characters by their
numeric values and instead add a tag that designates the
code set for the document. MIME uses this idea in the "Charset"
field of its mail headers, although the acceptable values for
this field are ill-defined. OSF has a more completely architected
solution in its Character and Code Set Registry. Because there
is no agreement on the string names for code sets (ISO 8859-1
may be called any of

ISO8859-1
iso88591
Latin-1
8859-1
ISO-8859-1

or something else on individual systems), OSF created a registry
that assigns a unique numeric value to all registered sets and
then lets individual sites map between their local string name
and our unique value. There currently are 134 code sets in the
OSF registry, and it is in use within OSF and some X/Open groups.
I can send a full description of the registry if you're
interested.

The advantage to using a code set tag and avoiding the use
of numeric values is that it's a crucial first step in enabling
HTML to support many of the code sets that people already are
using all over the world. With this change, there would be
nothing in HTML to limit it to the small group of languages
Latin-1 supports. If a Japanese user in Tokyo wanted to send
a document to a colleague in Osaka, he/she could tag the
document as being in (say) Japanese EUC and the colleague
would have the software to display the text correctly. On
the other hand, I don't have the software to display Japanese
text properly, so if someone sent me a Japanese document, the
information in the tag would give me a way to determine its
encoding. I could then reject the document and/or send a note
back to the originator explaining my problem.

So, those are the three main options to consider for HTML.
Actually, there's a fourth option that's a combination of
#2 and #3 above, and it's the one I recommend. I suggest
adding a tag for designating the code set (and offer the
OSF registered values; having done the work of setting up
the registry, I can tell you this is a job you want to avoid
if at all possible :-) ), while also allowing the option of
referring to characters by the ISO 10646 numeric values.

This solves the problem of frequently reused numeric values
being incorrectly treated as Latin-1 only. It provides an
unambiguous way to refer to all characters, if that ability
is desired. It also gives senders and recipients a simple
way to designate a basic attribute of their text -- the way
it's encoded -- much as HTML itself gives a simple way to
designate basic attributes like paragraph boundaries,
bolding, titles, etc.

I further recommend that HTML require support for ASCII
(the official standard designation is ISO 646 IRV:1991,
where IRV stands for International Reference Version).
This means ASCII support would be in Level 0 of the spec.
The level to which you would assign support for other code
sets is up to you, but I caution against automatically
putting Latin-1 into Level 0. Think about random international
users. Does it make sense to require that they support Latin-1?
Virtually everybody already handles ASCII, but there's little
need for (or support of) Latin-1 characters outside Europe,
Australia, and the Americas.

In addition to the overall issue regarding Latin-1, I'd also
like to raise a somewhat related, but less pervasive issue.
It's the use of entity names (agrave, ecirc, etc.) in place
of actual characters. These are much less of a problem than
using numeric values, because the entity name applies to an
abstract character and theoretically allows it to be encoded
in many different ways (ISO 8859-1, Microsoft Roman, HP ROMAN8,
Macintosh Roman, ISO 10646 and other code sets in current use,
for example, all assign aacute a different numeric value).
However, entity names are not a good long-term design choice
if you want HTML source (and output?) to remain even somewhat
human-readable.

As I read the spec, HTML allows the use of entity names in
the event that a local system doesn't support I/O or interpretation
of all the Latin-1 characters. I don't understand how the entity
names help in this situation -- if my system doesn't let me enter
an actual a-with-grave-accent, how does using the name "agrave"
make things better? And if my system doesn't know how to interpret
such a character correctly, it certainly won't know what to do
with an abstract name like "agrave", either.

Perhaps the intention is that as a user, I can receive documents
that may have accented characters in them, and then, if my system
doesn't support such characters, I can run a converter that
changes them to entity names. Thus, a converted string might
look something like this:

Les mots de la m&ecircme famille on &eacutet&eacute
group&eacutes. . .

This is not really human-readable any more, so I don't see
what value the entity names provide. Am I missing something?

Also, if you do intend HTML to have worldwide use, then source
text might become all (or nearly all) entity names. This is even
less human-readable. Also, since entity names do not exist for
the vast majority of the world's characters, HTML (or someone)
would have to create them. This is a huge job.

Based on my current understanding of entity names, I can't
see what value they provide, so I recommend removing them from
the spec. Please let me know if I've misunderstood their purpose.
(BTW, I can see the value of entity names for characters that
might be incorrectly interpreted -- like the ampersand, and
greater-than/less-than signs. I think it's fine to keep them
in the spec.)