Re: HTML 2.0 LAST CALL: Hyperlinking, Forms, Elements

lilley (lilley@afs.mcc.ac.uk)
Thu, 1 Jun 95 16:18:21 EDT

Dan said:
> Speak now or forever hold your peace. I made a good faith effort to
> incorporate all comments submitted to this point. If you've submitted
> comments, check to see if they've been addressed. If not, resubmit
> them.

Firstly, thanks Dan for the excellent efforts you have made to represent
consensus in this spec. I feel that a small item has slipped through, and
as requested am bringing it to your attention.

Summary
-------
There is a conflict between section 5.1, The ISO Latin 1 Character Repertoire,
and the set of named entities referenced from the DTD. Consensus appeared
to have been reached in October 1994 to add missing entities but the changes
do not seem to have made it into the current spec. Proposed changes are
supplied.

Reassurance
-----------
This email does not suggest adding named entities for any characters
outside the ISO Latin-1 repertoire. There is no impact on the font
resources needed by current browsers or rendering capabilities expected
of them.

Problem description
-------------------

Section 5.1 (p25 of the A4 PostScript version) states:

The HTML DTD references the Added Latin 1 entity set, to allow mnemonic
representation of Latin 1 characters using only the widely supported
ASCII character set repertoire.

However, the DTD references a collection of entities called

ISO 8879-1976//ENTITIES Added Latin 1//EN//HTML

which only supplies named entities for a subset of the non-ASCII characters
in ISO Latin-1, namely the accented characters. The remaining characters
may only be referred to by including their 8bit code positions or by using
numeric entity references (listed in the non-normative Appendix A).

Thus, either the text in 5.1 should be altered to read

[...] selected Latin 1 characters [...]

which leaves the inconsistency of representation, or (preferably) the
number of named entities should be expanded, as per previous perceived
consensus, to include the missing characters. This might be done by

1) referencing an expanded collection of entities with the same name
2) referencing an expanded collection of entities with a new name
3) referencing the old collection of entities, plus an additional collection
4) placing the additional collection in A.3 proposed features

There are good arguments for all alternatives; the group must decide. My
personal preference would be 2.

Evidence of consensus
---------------------

On Mon, 10 Oct 1994 10:52:16 -0500 Daniel W. Connolly
(then connolly@hal.com) said in a thread entitled "Perceived Consensus:
Murray's entity stuff goes in"

<http://www.acl.lanl.gov/HTML_WG/html-wg-94q4.messages/0048.html>:

> Agreed: if we need names for characters, and there's an ISO entity
> name for the character, we'll use it.

> I'm willing to commit to supporting mnemonic entities for characters
> that are already in the HTML character set (ISO8859-1) like &shy;,
> &nbsp;, &iexcl;, &laquo;, and such.

On Tue, 11 Oct 1994 10:18:41 -0400 (EDT) Murray Maloney (murray@sco.COM)
said in the same thread:

<http://www.acl.lanl.gov/HTML_WG/html-wg-94q4.messages/0052.html>

> By which I think you mean that if a character is already supported,
> by virtue of it being part of the supported ISO8859-1, then we
> could commit to providing "character entity" support in addition
> to the "numeric character references". This is more specific,
> for ISO8859-1, than I was expecting from the spec. But it is
> certainly an acceptable "stake in the ground" from my perspective.

Arguments for option 1)
-----------------------

The collection ISO 8879-1976//ENTITIES Added Latin 1//EN//HTML is based on
ISO 8879-1986//ENTITIES Added Latin 1//EN but has been modified already
to support HTML, so it could be modified some more.

On Mon, 12 Dec 1994 19:55:53 +0100 Daniel W. Connolly (then connolly@hal.com)
wrote on www-html in a thread entitled "Baffling math problems [Was:
HTML 3.0 DTD ]"

<http://gummo.stanford.edu/hypermail/www-html-1994q4/0152.html>

> The Added Latin 1 entity set defines a bunch of names for Latin 1
> characters. The SGML spec appendix that defines it makes no reference
> to the Latin 1 character set (ISO-8859-1). It maps those names to
> these thingies called CDATA entities -- system dependent data
> entities. I believe the intention is that the CDATA entities are
> supposed to be replaced on a per-SGML-system basis. So you might
> see TeX version of "ISO 8879-1986//ENTITIES Added Latin 1//EN", with:

> <!ENTITY eacute CDATA "\eacute" -- for TeX -->

> Since the document character set for HTML includes all the characters
> referred to by those names, there's no need to use system-specific
> mappings. The entities can be mapped to characters within the document
> character set.

> In response to the same feedback you saw, this set of definitions is
> now called:

> "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML"

Arguments for option 2)
-----------------------

There is a precedent for using a new name for the expanded collection of
named entities. In Dave Raggetts draft html3.dtd, version

Draft: Fri 24-Mar-95 09:46:33

says

<!-- The HTML list of Latin-1 entities includes the full range
of characters in widely available Latin-1 fonts, and as such
is a mixture of ISOlat1 and other ISO publishing symbols -->

<!ENTITY % HTMLlat1 PUBLIC
"-//IETF//ENTITIES Added Latin 1 for HTML//EN">
%HTMLlat1;

Arguments for option 3)
-----------------------

Minimal changes compared to previous drafts, the changes are localised in
a separate collection. What do we call it, though, and how do we explain why
the entities are split into two collections

Arguments for option 4)
-----------------------

Not all existing browsers implement all the extra named entities. But then, not
all browsers implement everything anyway. Supporting the extra entities is
little work. Existing browsers support some of the named entities already.

The missing entities (example, for option 2)
--------------------------------------------

a) Alter the comment block to read (something like):

<!-- Portions of this text are copyright ISO:

(C) International Organization for Standardization 1986
Permission to copy in any form is granted for use with
conforming SGML systems and applications as defined in
ISO 8879, provided this notice is included in all copies.
-->
<!-- Character entity set. Typical invocation:
<!ENTITY % HTMLlat1 PUBLIC
"-//IETF//ENTITIES Latin 1 for HTML//EN">
%HTMLlat1;
-->
<!-- Modified for use in HTML
$Id: ISOlat1.sgml,v 1.1 1994/09/24 14:06:34 connolly Exp $
-->
<!-- Modified to add characters not in Added Latin 1 which are in
the ISO Latin-1 character repertoire, which could only be
referred to by numeric references.
Also added the standard lt gt amp quot entities from HTML 2.0
HTMLlat1.sgml Chris Lilley, 13 March 1995
-->

B) Add these entities:

<!--
Entities that aren't accented characters, and so not in
ISO Added Latin 1. Entity names and comments based on relevant
entities in
"ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN"

The four entities umlaut. macron, acute, cedilla
were not in ISO Numeric and Special Graphic
either; I took their names from the numeric entity list in
http://www.hpl.hp.co.uk/people/dsr/html/latin1.html
Chris Lilley, 13 March 1995
-->
<!ENTITY yuml CDATA "&#255;" -- small y, dieresis or umlaut mark -->

<!ENTITY iexcl CDATA "&#161;" -- inverted exclamation mark -->
<!ENTITY cent CDATA "&#162" -- cent sign -->
<!ENTITY pound CDATA "&#163" -- pound sterling sign -->
<!ENTITY curren CDATA "&#164" -- general currency sign -->
<!ENTITY yen CDATA "&#165" -- yen sign -->
<!ENTITY brvbar CDATA "&#166" -- broken (vertical) bar -->
<!ENTITY sect CDATA "&#167" -- section sign -->
<!ENTITY umlaut CDATA "&#168" -- umlaut (dieresis) -->
<!ENTITY copy CDATA "&#169" -- copyright sign -->
<!ENTITY ordf CDATA "&#170" -- ordinal indicator, feminine -->
<!ENTITY laquo CDATA "&#171" -- angle quotation mark, left -->
<!ENTITY not CDATA "&#172" -- not sign -->
<!ENTITY shy CDATA "&#173" -- soft hyphen -->
<!ENTITY reg CDATA "&#174" -- registered trademark -->
<!ENTITY macron CDATA "&#175" -- macron -->
<!ENTITY deg CDATA "&#176" -- degree sign -->
<!ENTITY plusmn CDATA "&#177" -- plus-or-minus sign -->
<!ENTITY sup2 CDATA "&#178" -- superscript two -->
<!ENTITY sup3 CDATA "&#179" -- superscript three -->
<!ENTITY acute CDATA "&#180" -- acute accent -->
<!ENTITY micro CDATA "&#181" -- micro sign -->
<!ENTITY para CDATA "&#182" -- pilcrow (paragraph sign) -->
<!ENTITY middot CDATA "&#183" -- middle dot (centred decimal point) -->
<!ENTITY cedilla CDATA "&#184" -- cedilla accent -->
<!ENTITY sup1 CDATA "&#185" -- superscript one -->
<!ENTITY ordm CDATA "&#186" -- ordinal indicator, masculine -->
<!ENTITY raquo CDATA "&#187" -- angle quotation mark, right -->
<!ENTITY frac14 CDATA "&#188" -- fraction one-quarter -->
<!ENTITY frac12 CDATA "&#189" -- fraction one-half -->
<!ENTITY frac34 CDATA "&#190" -- fraction three-quarters -->
<!ENTITY iquest CDATA "&#191" -- inverted question mark -->

<!-- the odd ones tucked in amongs the sequence of accented letters -->
<!ENTITY times CDATA "&#215" -- multiply sign -->
<!ENTITY divide CDATA "&#247" -- divide sign -->

<!-- perhaps these should now be here, rather than inlined? -->
<!ENTITY amp CDATA "&#38;" -- ampersand -->
<!ENTITY gt CDATA "&#62;" -- greater than -->
<!ENTITY lt CDATA "&#60;" -- less than -->
<!ENTITY quot CDATA "&#34;" -- double quote -->

Dan said:
> So in the interest of time, please keep your comments focused. And
> remember: for bonus points: please suggest replacement text! (and
> always excerpt the original, citing the revision date and preferably a
> URL).

I hope I have satisfied these requirements.

-- 
Chris Lilley, Technical Author
+-------------------------------------------------------------------+
|       Manchester and North HPC Training & Education Centre        |
+-------------------------------------------------------------------+
| Computer Graphics Unit,             Email: Chris.Lilley@mcc.ac.uk |
| Manchester Computing Centre,        Voice: +44 161 275 6045       |
| Oxford Road, Manchester, UK.          Fax: +44 161 275 6040       |
| M13 9PL                            BioMOO: ChrisL                 |
|     URI: http://info.mcc.ac.uk/CGU/staff/lilley/lilley.html       | 
+-------------------------------------------------------------------+
|     "The first W in WWW will not wait."   François Yergeau        |
+-------------------------------------------------------------------+