Re: Entities

Murray Maloney (murray@sco.COM)
Thu, 22 Sep 94 10:32:49 EDT

I'm really pleased that this issue has come up just now.
I have, finally, come back to doing the work that I had
promised to do for the HTML 2.0 spec and a big part of it
pertains to entities and numeric character references.

I will be posting several mail messages today containing
the HTML-encoded versions of my rewrites of various
parts of the spec pertaining to "text".

The four pieces are:

Character Sets (Charsets.html)

Character Data (Text.html)
Special Characters
- Space
- Hyphen
Control Characters
- Horizontal Tab (HT - 9 dec)
- Line Feed (LF - 10 dec)
- Carriage Return (CR - 13 dec)
Numeric Character References
Character Entities
NOTE: Markup Characters
NOTE: CDATA, RCDATA
Comments
Note: Tags in Comments

Character Entity References (Entities.html)
Numeric Character References (NumCharRef.html)

Please install these files on a WWW server and review them.
I'm sorry, but I don't have access to a WWW server outside
of our corporate firewall or I would have published on the Web.
I'm also sorry that I don't know how to do a multi-part MIME
mail message, or I would have done that instead of several messages.

You might also want to install the following fragment:

<UL>
<LI> <A HREF="Charsets.html"> Character Sets </A>
<LI> <A HREF="Text.html"> Character Data </A>
<LI> <A HREF="Entities.html"> Entities </A>
<LI> <A HREF="NumCharRef.html"> Numeric Character References </A>
</UL>

I think that I have captured all of the comments that I had received
up until yesterday, and I think that these documents now accurately
reflect the HTML 2.0 specification.

Please comment. Also, please advise me what I need to do to have
this material incorporated into the spec.

Murray

P.S. I have noticed one curious thing... When I use &#11; I get a space.
ASCII 11 is supposed to be a Vertical Tab (VT), so I find it a bit odd.

Also, see my comments in reply to Dan and Lee below...

>
> In message <9409211753.AA10388@sqrex.sq.com>, lee@sq.com writes:
> >The enclosed list of entities comprises the ones that are in the draft plus
> >the ones that define the other ISO 8859-1 characters.
> >Since it's stated that the character set is ISO 8859-1, shouldn't the full
> >set of entites be available?

The full set of characters in 8859/1 is available through
numeric character reference except for nbsp.
The entities supported are 8879/1 Added Latin 1.

> >The additions are at the end.
> >
> >Mosaic 2.4 doesn't seem to support these, so you culd argue that this is not
> >current practice. It does seem odd to support these chars with &#ddd; and
> >not with named references, though.
>
> Only a little. I suspect this will be one of the issues addresed in
> HTML 2.1.
>
> Here's a table that summarizes the HTML character set
> as I understand it. Note that characters 27, 127-159, 172,
> 215, and 247 (decimal) still represent outstanding issues:

None of the control characters are supported except for
09 (HT), 10 (LF), and 11 (VT).

That means that 00-08, 12-31, 127-160, and 215 are outstanding issues.
The multiply sign currently at #172 is not legitimately part of 8859-1.
However, the division sign at #247 is part of 8859-1.

>
> 27: an escape character for ISO2022 escape sequences?
> (the multi-lingual document issue again...)

We have not declared support for ISO2022 is HTML 2.0 have we?
>
> 127-159: is there any defined use for these?

Yes, ISO-6429 defines the codes from 128-159. Seven are undefined.
The remainder have potential uses in browsers, retrieval engines,
HTTP, and editors.

127 del No useful purpose in HTML
132 ind index
133 nel next Line
134 ssa start of selected area
135 esa end of selected area
136 hts horiz tab set
137 htj horiz tab with justification
138 vts vert tab set
139 pld partial line down
140 plu partial line up
141 ri reverse index
142 ss2 single shift 2
143 ss3 single shift 3
144 dcs device control string
145 pu1 private use 1
146 pu2 private use 2
147 sts set transmit state
148 cch cancel character
149 mw message waiting
150 spa start of guarded protected area
151 epa end of guarded protected area
155 csi control sequence indicator
156 st
157 osc operating system command
158 pm privacy message
159 apc application program command
173 shy soft hyphen
>
> 172: in the X fonts, it's a "logical not" character.
> Is this part of the ISO8859-1 standard?

Yes, it is. See my list.

> What's the SGML entity name (&lnot; ?)

Irrelevant until HTML 2.1 or later.
The entire list of supported entity names
is presented in my list.
>
> 215: in X fonts, it's a "times" character...

But not in 8859-1.

> 247: in X fonts, it's a "divide" character...

And in 8859-1.
>
> There's also the question of whether PRE lines end in CR, LF, CRLF,
> or any of the above.
>
> Number Entity Glyph Description
> 0(00): --UNUSED--
> 1(01): --UNUSED--
> 2(02): --UNUSED--
> 3(03): --UNUSED--
> 4(04): --UNUSED--
> 5(05): --UNUSED--
> 6(06): --UNUSED--
> 7(07): --UNUSED--
> 8(08): --UNUSED--
> 9(09): TAB (just like space, except in pre)
> 10(0A): LF (just like spece, except in pre)
> 11(0B): --UNUSED--

Hmmm! Not what I discovered.

> 12(0C): --UNUSED--
> 13(0D): CR (just like space, except in pre)
> 14(0E): --UNUSED--
> 15(0F): --UNUSED--
> 16(10): --UNUSED--
> 17(11): --UNUSED--
> 18(12): --UNUSED--
> 19(13): --UNUSED--
> 20(14): --UNUSED--
> 21(15): --UNUSED--
> 22(16): --UNUSED--
> 23(17): --UNUSED--
> 24(18): --UNUSED--
> 25(19): --UNUSED--
> 26(1A): --UNUSED--
> 27(1B): ESC ???

UNUSED

> 28(1C): --UNUSED--
> 29(1D): --UNUSED--
> 30(1E): --UNUSED--
> 31(1F): --UNUSED--
> 32(20): ala ISO646-IRV (ASCII)
> 33(21): ! ala ISO646-IRV (ASCII)
> 34(22): " ala ISO646-IRV (ASCII)
> 35(23): # ala ISO646-IRV (ASCII)
> 36(24): $ ala ISO646-IRV (ASCII)
> 37(25): % ala ISO646-IRV (ASCII)
> 38(26): & ala ISO646-IRV (ASCII)
> 39(27): ' ala ISO646-IRV (ASCII)
> 40(28): ( ala ISO646-IRV (ASCII)
> 41(29): ) ala ISO646-IRV (ASCII)
> 42(2A): * ala ISO646-IRV (ASCII)
> 43(2B): + ala ISO646-IRV (ASCII)
> 44(2C): , ala ISO646-IRV (ASCII)
> 45(2D): - ala ISO646-IRV (ASCII)
> 46(2E): . ala ISO646-IRV (ASCII)
> 47(2F): / ala ISO646-IRV (ASCII)
> 48(30): 0 ala ISO646-IRV (ASCII)
> 49(31): 1 ala ISO646-IRV (ASCII)
> 50(32): 2 ala ISO646-IRV (ASCII)
> 51(33): 3 ala ISO646-IRV (ASCII)
> 52(34): 4 ala ISO646-IRV (ASCII)
> 53(35): 5 ala ISO646-IRV (ASCII)
> 54(36): 6 ala ISO646-IRV (ASCII)
> 55(37): 7 ala ISO646-IRV (ASCII)
> 56(38): 8 ala ISO646-IRV (ASCII)
> 57(39): 9 ala ISO646-IRV (ASCII)
> 58(3A): : ala ISO646-IRV (ASCII)
> 59(3B): ; ala ISO646-IRV (ASCII)
> 60(3C): < ala ISO646-IRV (ASCII)
> 61(3D): = ala ISO646-IRV (ASCII)
> 62(3E): > ala ISO646-IRV (ASCII)
> 63(3F): ? ala ISO646-IRV (ASCII)
> 64(40): @ ala ISO646-IRV (ASCII)
> 65(41): A ala ISO646-IRV (ASCII)
> 66(42): B ala ISO646-IRV (ASCII)
> 67(43): C ala ISO646-IRV (ASCII)
> 68(44): D ala ISO646-IRV (ASCII)
> 69(45): E ala ISO646-IRV (ASCII)
> 70(46): F ala ISO646-IRV (ASCII)
> 71(47): G ala ISO646-IRV (ASCII)
> 72(48): H ala ISO646-IRV (ASCII)
> 73(49): I ala ISO646-IRV (ASCII)
> 74(4A): J ala ISO646-IRV (ASCII)
> 75(4B): K ala ISO646-IRV (ASCII)
> 76(4C): L ala ISO646-IRV (ASCII)
> 77(4D): M ala ISO646-IRV (ASCII)
> 78(4E): N ala ISO646-IRV (ASCII)
> 79(4F): O ala ISO646-IRV (ASCII)
> 80(50): P ala ISO646-IRV (ASCII)
> 81(51): Q ala ISO646-IRV (ASCII)
> 82(52): R ala ISO646-IRV (ASCII)
> 83(53): S ala ISO646-IRV (ASCII)
> 84(54): T ala ISO646-IRV (ASCII)
> 85(55): U ala ISO646-IRV (ASCII)
> 86(56): V ala ISO646-IRV (ASCII)
> 87(57): W ala ISO646-IRV (ASCII)
> 88(58): X ala ISO646-IRV (ASCII)
> 89(59): Y ala ISO646-IRV (ASCII)
> 90(5A): Z ala ISO646-IRV (ASCII)
> 91(5B): [ ala ISO646-IRV (ASCII)
> 92(5C): \ ala ISO646-IRV (ASCII)
> 93(5D): ] ala ISO646-IRV (ASCII)
> 94(5E): ^ ala ISO646-IRV (ASCII)
> 95(5F): _ ala ISO646-IRV (ASCII)
> 96(60): ` ala ISO646-IRV (ASCII)
> 97(61): a ala ISO646-IRV (ASCII)
> 98(62): b ala ISO646-IRV (ASCII)
> 99(63): c ala ISO646-IRV (ASCII)
> 100(64): d ala ISO646-IRV (ASCII)
> 101(65): e ala ISO646-IRV (ASCII)
> 102(66): f ala ISO646-IRV (ASCII)
> 103(67): g ala ISO646-IRV (ASCII)
> 104(68): h ala ISO646-IRV (ASCII)
> 105(69): i ala ISO646-IRV (ASCII)
> 106(6A): j ala ISO646-IRV (ASCII)
> 107(6B): k ala ISO646-IRV (ASCII)
> 108(6C): l ala ISO646-IRV (ASCII)
> 109(6D): m ala ISO646-IRV (ASCII)
> 110(6E): n ala ISO646-IRV (ASCII)
> 111(6F): o ala ISO646-IRV (ASCII)
> 112(70): p ala ISO646-IRV (ASCII)
> 113(71): q ala ISO646-IRV (ASCII)
> 114(72): r ala ISO646-IRV (ASCII)
> 115(73): s ala ISO646-IRV (ASCII)
> 116(74): t ala ISO646-IRV (ASCII)
> 117(75): u ala ISO646-IRV (ASCII)
> 118(76): v ala ISO646-IRV (ASCII)
> 119(77): w ala ISO646-IRV (ASCII)
> 120(78): x ala ISO646-IRV (ASCII)
> 121(79): y ala ISO646-IRV (ASCII)
> 122(7A): z ala ISO646-IRV (ASCII)
> 123(7B): { ala ISO646-IRV (ASCII)
> 124(7C): | ala ISO646-IRV (ASCII)
> 125(7D): } ala ISO646-IRV (ASCII)
> 126(7E): ~ ala ISO646-IRV (ASCII)
> 127(7F):  ???

UNUSED

128(80): UNUSED
129(81): UNUSED
130(82): UNUSED
131(83): UNUSED
132(84): UNUSED
133(85): UNUSED
134(86): UNUSED
135(87): UNUSED
136(88): UNUSED
137(89): UNUSED
138(8A): UNUSED
139(8B): UNUSED
140(8C): UNUSED
141(8D): UNUSED
142(8E): UNUSED
143(8F): UNUSED
144(90): UNUSED
145(91): UNUSED
146(92): UNUSED
147(93): UNUSED
148(94): UNUSED
149(95): UNUSED
150(96): UNUSED
151(97): UNUSED
152(98): UNUSED
153(99): UNUSED
154(9A): UNUSED
155(9B): UNUSED
156(9C): UNUSED
157(9D): UNUSED
158(9E): UNUSED
159(9F): UNUSED

>From 160-191, the names listed are not usable as character entity names.
These characters can only be used as coded characters or numeric
character references.

> 160(A0): nbsp Non breaking space
> 161(A1): iexcl = inverted exclamation mark
> 162(A2): cent = cent sign
> 163(A3): pound = pound sign
> 164(A4): curren = general currency sign
> 165(A5): yen = /yen =yen sign
> 166(A6): brvbar = broken (vertical) bar
> 167(A7): sect = section sign
> 168(A8): uml =umlaut mark
> 169(A9): copy = copyright sign
> 170(AA): ordf = ordinal indicator, feminine
> 171(AB): laquo = angle quotation mark, left
> 172(AC): = not sign
> 173(AD): shy Soft Hyphen
> 174(AE): reg = /circledR =registered sign
> 175(AF): macr =macron
> 176(B0): ring =ring
> 177(B1): plusmn = /pm B: =plus-or-minus sign
> 178(B2): sup2 = superscript two
> 179(B3): sup3 = superscript three
> 180(B4): acute =acute accent
> 181(B5): micro = micro sign
> 182(B6): para = pilcrow (paragraph sign)
> 183(B7): middot = /centerdot B: =middle dot
> 184(B8): cedil =cedilla
> 185(B9): sup1 = superscript one
> 186(BA): ordm = ordinal indicator, masculine
> 187(BB): raquo = angle quotation mark, right
> 188(BC): frac14 = fraction one-quarter
> 189(BD): half = fraction one-half
> 190(BE): frac34 = fraction three-quarters
> 191(BF): iquest = inverted question mark
> 192(C0): Agrave capital A, grave accent
> 193(C1): Aacute capital A, acute accent
> 194(C2): Acirc capital A, circumflex accent
> 195(C3): Atilde capital A, tilde
> 196(C4): Auml capital A, dieresis or umlaut mark
> 197(C5): Aring capital A, ring
> 198(C6): AElig capital AE diphthong (ligature)
> 199(C7): Ccedil capital C, cedilla
> 200(C8): Egrave capital E, grave accent
> 201(C9): Eacute capital E, acute accent
> 202(CA): Ecirc capital E, circumflex accent
> 203(CB): Euml capital E, dieresis or umlaut mark
> 204(CC): Igrave capital I, grave accent
> 205(CD): Iacute capital I, acute accent
> 206(CE): Icirc capital I, circumflex accent
> 207(CF): Iuml capital I, dieresis or umlaut mark
> 208(D0): ETH capital Eth, Icelandic
> 209(D1): Ntilde capital N, tilde
> 210(D2): Ograve capital O, grave accent
> 211(D3): Oacute capital O, acute accent
> 212(D4): Ocirc capital O, circumflex accent
> 213(D5): Otilde capital O, tilde
> 214(D6): Ouml capital O, dieresis or umlaut mark
> 215(D7): multiply sign
> 216(D8): Oslash capital O, slash
> 217(D9): Ugrave capital U, grave accent
> 218(DA): Uacute capital U, acute accent
> 219(DB): Ucirc capital U, circumflex accent
> 220(DC): Uuml capital U, dieresis or umlaut mark
> 221(DD): Yacute capital Y, acute accent
> 222(DE): THORN capital THORN, Icelandic
> 223(DF): szlig small sharp s, German (sz ligature)
> 224(E0): agrave small a, grave accent
> 225(E1): aacute small a, acute accent
> 226(E2): acirc small a, circumflex accent
> 227(E3): atilde small a, tilde
> 228(E4): auml small a, dieresis or umlaut mark
> 229(E5): aring small a, ring
> 230(E6): aelig small ae diphthong (ligature)
> 231(E7): ccedil small c, cedilla
> 232(E8): egrave small e, grave accent
> 233(E9): eacute small e, acute accent
> 234(EA): ecirc small e, circumflex accent
> 235(EB): euml small e, dieresis or umlaut mark
> 236(EC): igrave small i, grave accent
> 237(ED): iacute small i, acute accent
> 238(EE): icirc small i, circumflex accent
> 239(EF): iuml small i, dieresis or umlaut mark
> 240(F0): eth small eth, Icelandic
> 241(F1): ntilde small n, tilde
> 242(F2): ograve small o, grave accent
> 243(F3): oacute small o, acute accent
> 244(F4): ocirc small o, circumflex accent
> 245(F5): otilde small o, tilde
> 246(F6): ouml small o, dieresis or umlaut mark
> 247(F7): divide sign
> 248(F8): oslash small o, slash
> 249(F9): ugrave small u, grave accent
> 250(FA): uacute small u, acute accent
> 251(FB): ucirc small u, circumflex accent
> 252(FC): uuml small u, dieresis or umlaut mark
> 253(FD): yacute small y, acute accent
> 254(FE): thorn small thorn, Icelandic
> 255(FF): yuml small y, dieresis or umlaut mark