Re: progress on HTML 2.0 reconstruction

Francois Yergeau (yergeau@alis.ca)
Wed, 29 Mar 95 10:48:19 EST

Roy T. Fielding <fielding@avron.ICS.UCI.EDU> writes:
>
>>>...
>>>
>>>When an HTML document is encoded using US-ASCII, the mechanisms of
>>>character entity references (Section 6.3) may be used to encode
>>>additional characters from ISO-8859-1.
>>
>> I don't think the use of entities should be restricted to
>> ASCII-encoded documents. They are always legal, as long as one has
>> ASCII to mark them up (see section 6.3.1).
>
>I don't think that sentence restricts them at all -- it is only
>referring to when a document is limited to the US-ASCII encoding.

Strictly speaking, you are right, but singling out the case of ASCII
makes it *appear* that entities are only legal in that case. Clarity
never hurts: you wouldn't believe the number of people I run in who
insist that entities *must* be used, that plain and simple ISO-Latin-1
is illegal in HTML. These are not dumb people, they simply have not
read the spec rigorously enough, and the same is bound to happen with
the above sentence. *Please* change it to something like:

Irrespective of the encoding of the document, the mechanisms...

>>>6.3.2 Character octet reference
>> ...
>I believe the WG decided on the interpretation:
>
> The character octet references are not dependent on the character
> set encoding of the document. For example, "&#215;" always represents
> the ISO-8859-1 multiply sign, even when the document's declared
> character set is other than ISO-8859-1.
>
>so I have added that to the spec.

I understand that this decision was made to preserve the numeric
character references in the DTD itself (cf. section on "Character
mnemonic entities") but I think this is going too far, and actually
strays further from SGML than needed.

Consider the case where I receive a document encoded in ISO-Cyrillic.
I believe (wrongly?) that simply changing the second BASESET statement
in the SGML declaration to refer to "Right part of ISO-Cyrillic" or
some such would make the document fully conformant w/r to that
modified DTD, provided numerical character references are interpreted
according to the ISO-Cyrillic encoding, which happens to preserve the
important entities declared in "Character mnemonic entities". So
insisting on a Latin-1 interpretation appears to be insisting on SGML
non-conformance. If the charset is such that ASCII is not preserved,
then the DTD needs to be translated, and SGML says that numeric
character references are translated along.

If it seems too risky to say "interpret as per the announced charset",
why not simply forbid numeric character references in documents not
encoded in Latin-1, as a stopgap measure? I don't think that would
break with current practice, so it should be acceptable.

-- 
François Yergeau <yergeau@alis.ca>
Alis Technologies Inc., Montréal
Tél: +1 (514) 738-9171
Fax: +1 (514) 342-0318