HTML 2.0 comments (Second of two)

Sandra Martin O'Donnell (odonnell@osf.org)
Wed, 23 Nov 94 14:01:24 EST

This is the second of two messages with comments on the
2.0 HTML spec. This message includes comments or questions
on individual sections of the spec.

I look forward to learning more about HTML.

-- Sandra

---------------------------------------------------------------------
Sandra Martin O'Donnell email: odonnell@osf.org
Open Software Foundation phone: +1 (617) 621-8707
11 Cambridge Center fax: +1 (617) 225-2782
Cambridge, MA 02142 USA
---------------------------------------------------------------------

COMMENTS ON HTML SPECIFICATION -- 2.0
(Second of two)

Section 2.2
I have questions about several of the element/tag types.
For the ADDRESS tag, are there assumptions about the format
of an address? For example, would HTML assume that an address
is of this form?

Addressee
Number Street-name
City, State ZIP Code

If so, this is inadequate for most of the world's users. (Please
let me know if you'd like examples of address formats from areas
other than the U.S.) If the tag is culturally-neutral, and so
allows other information (postal codes, district codes, reverse
ordering, etc.), what is the purpose of this tag? What information
does it provide?

For the PRE tag, the spec lists a WIDTH field with a value of 80.
Are those static display columns such as you might use on a
dumb terminal with a monospaced font?

For the BLOCKQUOTE tag, do you make assumptions about the
appearance of quoted material? For example, is it permissible
to use any of

"quote"
,,quote''
<<quote>>

or some other appearance? Or do you assume "quote" is the
only way quotes are formatted?

Section 2.4
The elements here are very European-language centric. Font
changes (e.g., roman to bold or roman to italic) are not
commonly used for, say, Asian text. In a Japanese document,
emphasized information might be underlined or written in
katakana (a phonetic writing system) rather than in ideographic
Kanji.

I'm not really asking for anything specific here, but just making
you aware that you may need to expand this list of elements, and
that you should definitely avoid associating specific appearance
with some elements. For example, it's good that the EM tag is
defined as "provides typographic emphasis, typically italics"
rather than "provides typographic emphasis using italics".

Section 2.6
What units do values for attributes like MAXLENGTH and SIZE
use? Are they numbers of bytes? The spec needs to provide that
information. Actually, I suspect you currently assume these
attributes are for numbers of characters, but this is incorrect
because characters are variable (they can consume varying numbers
of bytes), while bytes are static.

For the SELECT tag, what if someone wants to provide OPTION
values in a variety of languages? For example, if I'm running
an English application, I'd like the OPTIONS to be in English
(vanilla, strawberry,...), while if it's a French application,
I want the OPTIONS in French (vanille, fraise,...). How,
if at all, do you support this?

Section 2.7
Please see my separate long :-) email on the problems with
the ISO Latin-1 Character Representation stuff. If you accept
my recommendations, this section needs to change.

Section 3.2
As with the previous comment, I believe your design choice
for Character Sets is too limited. Please see my other email
for more details.

One specific error in this section is the NOTE. ASCII is not
equivalent to ISO 646; it is equivalent to a specific version
of ISO 646. That version is ISO 646 IRV:1991, where IRV stands
for International Reference Version. The generic ISO 646 describes
rules for encoding characters in a seven-bit space, and those
rules have been applied to produce many separate versions (e.g.,
ISO 646 Danish, ISO 646 French, ISO 646 IRV, etc.).

Section 3.3
The description of an SGML declaration says that names are a
maximum of 72 characters, but that should be 72 bytes. As noted,
characters are variable; bytes aren't.

Section 3.4.2
The description of element names says:

. . .
An element name consists of a letter followed by up to 72
letters, digits, periods, or hyphens. . .

Can any letter be used in a name? I'm sure the ASCII letters
A-Z and a-z are permissible, but what about Latin-1 letters
like a-acute and C-circumflex? Are they okay? How about other
letters like a-ogonek and S-caron (both in Latin-2)?

The spec needs to specify the set of acceptable letters. I
recommend either allowing all letters (including those in code
sets other than Latin-1) or limiting names to ASCII letters.
Allowing only ASCII and Latin-1 is not a good solution because
it forces all sites to support Latin-1. (See my other email for
more info.)

Section 3.4.3
Same issue for name tokens as the previous comment. The spec
says a token consists of "a sequence of letters,..." It should
spell out what letters are acceptable.

Later in this section, it says attribute values are limited
to 1024 characters. That should be 1024 bytes.

Section 3.4.5
This needs to change if you accept my recommendations regarding
Latin-1 as outlined in my other email. However, there's also
a wording problem in this section. It says "...See the
Special Characters section of this specification for more
information." But this section is itself called "Special
Characters." The name of one of these sections (3.4.5 or 3.14)
needs to change.

Section 3.5.3
This section says documents can be queried with a keyword
search. Is there a list of acceptable keywords? If so, I must
have missed it.

Section 3.6.2
Again, does HTML make assumptions about what kind of information
is in an ADDRESS element and the order in which it appears? The
example in this section also includes a phone number. Do you
make assumptions that a phone number is a certain number of
digits or is formatted a certain way? (I can provide examples
of phone numbers with greater than or less than 10 digits, and
with varying formats. Please let me know if you'd like that info.)

Section 3.6.5
Please be aware that the "typical" renderings the spec lists
for various levels of headings is very American- and European-centric.
The fonts listed wouldn't be typical for, say, Asian text, and
the headings themselves wouldn't be "flush left against a left
margin" for right-to-left languages like Arabic or Hebrew. What
is the purpose of listing "typical" renderings? Can you omit them?

Section 3.7
In this section, it says "Level 1 implementations must render
highlighted text distinctly from plain text." What if there are
font limitations? Because of the number of characters in Asian
fonts (usually between 6000 and 7000), Asian users typically
have significantly fewer fonts available to them than do American
and European users. What if there is only one font on the system
for a given set of characters?

Section 3.8.4
Typo -- "...use <EM> except eeein the case..." should be
"...use <EM> except in the case..."

Section 3.11.2
Change "20 characters" and "24 characters" in describing
directory name listings to refer either to bytes or display
columns. I'm not sure which is correct, but characters isn't.
However, why does HTML specify the typical width of a
directory list column?

Section 3.12.2
The entire paragraph that begins "The PRE element may be used
with the optional WIDTH attribute..." needs to change to
reflect that WIDTH almost certainly does not measure characters;
it measures mono-space display columns.

Later in this section, it says that a horizontal tab "must be
interpreted as the smallest positive nonzero number of spaces
which will leave the number of characters so far on the line..."
Instead of referring to characters, it should be "...number of
display columns..."

Section 3.13.4
The description of MAXLENGTH should be changed to say it is
the maximum number of bytes, rather than characters. Same with
the sentence about the default.

The description of SIZE should be changed. From the context,
it appears that SIZE measures display columns, but it may
be measuring bytes. I'm not sure, but I know characters is
wrong.

Re: the RADIO element and VALUE field. Is it possible to have
multiple language versions of the name values? For example,
could HTML use a messaging system to allow, say, French,
English, and German (and Japanese, if you break free of the
Latin-1 limitation) versions of values?

Section 3.13.6
Same question as above. Can SELECT support multiple language
versions of the values?

Section 3.13.7
I assume ROWS and COLS measure units of fixed width display
rows and columns rather than characters. The spec needs to be
revised to reflect this. Also, the text at the end of this
section should be changed from "...1024 characters" and
"...240 characters" to refer to display columns or bytes. I'm
not sure which is right, but characters isn't.

Section 3.14
My other email explains why I think almost this entire section
about special characters is wrong. Please see that message.

Section 3.14.2 (and related others)
Is there any particular reason that HTML uses decimal numbers
when referring to characters by numeric values? Could that be
changed to hexadecimal?

Section 3.14.3
My other email explains why I think the dependency on Latin-1
numeric values is inappropriate. Please see that message.
By the way, tho, what would happen if I used, say, #227 in
an HTML document, but did not intend for it to be interpreted
as an a-tilde (a-tilde has the decimal value 227 in Latin-1)?
Suppose I intended for it to be interpreted as an a-breve
instead (a-breve is at 227 in Latin-2)? If I have the
appropriate fonts to render the text as an a-breve, is there
anything in HTML that will detect this not-intended-usage?

Section 3.15.2
See my other email for why I think this is a problem.
By the way, both the entity names and the numeric values
lists have a few errors. They list the character
small-a-with-tilde twice and omit the a-with-ring.

Using the numeric values, here's how it currently
reads:

..
&226 Small a, circumflex accent
&227 Small a, tilde
&228 Small a, tilde
&229 Small a, dieresis or umlaut mark
&230 Small ae dipthong (ligature)
..

To accurately match Latin-1, it should read:

..
&226 Small a, circumflex accent
&227 Small a, tilde
&228 Small a, dieresis or umlaut mark
&229 Small a, ring
&230 Small ae dipthong (ligature)
..

I don't think these tables should be in the spec at all,
but if they do stay, they should be accurate.

Section 5.1.3
Why do you propose adding the no-break space and soft hyphen
as additional special characters? What purpose will they serve?
My initial reaction is to recommend against them, but I'd like
to hear how they are intended to be used.

Section 5.2.3
Even though this is for an obsolete element, it really is
incorrect to equate ASCII and a MIME "text/plain" body.
"text/plain" refers to a lact of graphics, special presentation
semantics, etc. It has nothing to do with the code set ASCII.
I can have Latin-1 or Latin-2 or Japanese EUC text that is
"text/plain."

Section 6.1
The first BASESET listed for charset should be ISO 646 IRV:1991.
The 1991 version is exactly the same as ASCII. The earlier 1983
version substituted the currency symbol for the dollar sign.

Also, I don't understand the listing for the second BASESET.
Is that supposed to be ISO 8859-1? Why is it listed that
way? Also, as my other email indicates, I don't think this
should be the only other supported BASESET.

Section 6.2.1
See my other email for why I think these ISOlat1 definitions are
a bad idea.

That's it for now!