Re: Entities

Murray Maloney <murray@sco.COM>
Date: Thu, 22 Sep 94 10:32:49 EDT
Message-id: <9409221024.aa00606@dali.scocan.sco.COM>
Reply-To: murray@sco.COM
Originator: html-wg@oclc.org
Sender: html-wg@oclc.org
Precedence: bulk
From: Murray Maloney <murray@sco.COM>
To: Multiple recipients of list <html-wg@oclc.org>
Subject: Re: Entities
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
X-Comment: HTML Working Group (Private)
I'm really pleased that this issue has come up just now.
I have, finally, come back to doing the work that I had
promised to do for the HTML 2.0 spec and a big part of it
pertains to entities and numeric character references.

I will be posting several mail messages today containing
the HTML-encoded versions of my rewrites of various
parts of the spec pertaining to "text".  

The four pieces are:

	Character Sets (Charsets.html)

	Character Data (Text.html)
	    Special Characters
	      - Space  
	      - Hyphen 
	    Control Characters
	      - Horizontal Tab (HT - 9 dec) 
	      - Line Feed  (LF - 10 dec) 
	      - Carriage Return (CR - 13 dec) 
	    Numeric Character References 
	    Character Entities 
	    NOTE: Markup Characters
	    NOTE: CDATA, RCDATA
	    Comments
	    Note: Tags in Comments

	Character Entity References (Entities.html)
	Numeric Character References (NumCharRef.html)

Please install these files on a WWW server and review them.
I'm sorry, but I don't have access to a WWW server outside
of our corporate firewall or I would have published on the Web.
I'm also sorry that I don't know how to do a multi-part MIME
mail message, or I would have done that instead of several messages.

You might also want to install the following fragment:

	<UL>
	<LI> <A HREF="Charsets.html"> Character Sets </A>
	<LI> <A HREF="Text.html"> Character Data </A>
	<LI> <A HREF="Entities.html"> Entities </A>
	<LI> <A HREF="NumCharRef.html"> Numeric Character References </A>
	</UL>

I think that I have captured all of the comments that I had received
up until yesterday, and I think that these documents now accurately
reflect the HTML 2.0 specification.

Please comment.  Also, please advise me what I need to do to have
this material incorporated into the spec.

Murray

P.S.  I have noticed one curious thing...  When I use &#11; I get a space.
ASCII 11 is supposed to be a Vertical Tab (VT), so I find it a bit odd.

Also, see my comments in reply to Dan and Lee below...

> 
> In message <9409211753.AA10388@sqrex.sq.com>, lee@sq.com writes:
> >The enclosed list of entities comprises the ones that are in the draft plus
> >the ones that define the other ISO 8859-1 characters.
> >Since it's stated that the character set is ISO 8859-1, shouldn't the full
> >set of entites be available?

The full set of characters in 8859/1 is available through 
numeric character reference except for nbsp.
The entities supported are 8879/1 Added Latin 1.

> >The additions are at the end.
> >
> >Mosaic 2.4 doesn't seem to support these, so you culd argue that this is not
> >current practice.  It does seem odd to support these chars with &#ddd; and
> >not with named references, though.
> 
> Only a little. I suspect this will be one of the issues addresed in
> HTML 2.1.
> 
> Here's a table that summarizes the HTML character set
> as I understand it. Note that characters 27, 127-159, 172,
> 215, and 247 (decimal) still represent outstanding issues:

None of the control characters are supported except for 
09 (HT), 10 (LF), and 11 (VT).

That means that 00-08, 12-31, 127-160, and 215 are outstanding issues.
The multiply sign currently at #172 is not legitimately part of 8859-1.
However, the division sign at #247 is part of 8859-1.

> 
> 	27: an escape character for ISO2022 escape sequences?
> 		(the multi-lingual document issue again...)

We have not declared support for ISO2022 is HTML 2.0 have we?
> 
> 	127-159: is there any defined use for these?

Yes, ISO-6429 defines the codes from 128-159.  Seven are undefined.
The remainder have potential uses in browsers, retrieval engines,
HTTP, and editors.

127	del	No useful purpose in HTML
132	ind	index
133	nel	next Line
134	ssa	start of selected area
135	esa	end of selected area
136	hts	horiz tab set
137	htj	horiz tab with justification
138	vts	vert tab set
139	pld	partial line down
140	plu	partial line up
141	ri	reverse index
142	ss2	single shift 2
143	ss3	single shift 3
144	dcs	device control string
145	pu1	private use 1
146	pu2	private use 2
147	sts	set transmit state
148	cch	cancel character
149	mw	message waiting
150	spa	start of guarded protected area
151	epa	end of guarded protected area
155	csi	control sequence indicator
156	st	
157	osc	operating system command
158	pm	privacy message
159	apc	application program command
173	shy	soft hyphen
> 
> 	172: in the X fonts, it's a "logical not" character.
> 		Is this part of the ISO8859-1 standard?

		Yes, it is.  See my list.

> 		What's the SGML entity name (&lnot; ?)
	
		Irrelevant until HTML 2.1 or later.
		The entire list of supported entity names
		is presented in my list.
> 
> 	215: in X fonts, it's a "times" character...

	But not in 8859-1.

> 	247: in X fonts, it's a "divide" character...

	And in 8859-1.
> 
> There's also the question of whether PRE lines end in CR, LF, CRLF,
> or any of the above.
> 
>   Number  Entity Glyph Description
>    0(00):              --UNUSED--
>    1(01):              --UNUSED--
>    2(02):              --UNUSED--
>    3(03):              --UNUSED--
>    4(04):              --UNUSED--
>    5(05):              --UNUSED--
>    6(06):              --UNUSED--
>    7(07):              --UNUSED--
>    8(08):              --UNUSED--
>    9(09):              TAB (just like space, except in pre)
>   10(0A):              LF  (just like spece, except in pre)
>   11(0B):              --UNUSED--

	Hmmm!  Not what I discovered.

>   12(0C):              --UNUSED--
>   13(0D):              CR  (just like space, except in pre)
>   14(0E):              --UNUSED--
>   15(0F):              --UNUSED--
>   16(10):              --UNUSED--
>   17(11):              --UNUSED--
>   18(12):              --UNUSED--
>   19(13):              --UNUSED--
>   20(14):              --UNUSED--
>   21(15):              --UNUSED--
>   22(16):              --UNUSED--
>   23(17):              --UNUSED--
>   24(18):              --UNUSED--
>   25(19):              --UNUSED--
>   26(1A):              --UNUSED--
>   27(1B):              ESC ???
	
	UNUSED

>   28(1C):              --UNUSED--
>   29(1D):              --UNUSED--
>   30(1E):              --UNUSED--
>   31(1F):              --UNUSED--
>   32(20):               ala ISO646-IRV (ASCII)
>   33(21):            !  ala ISO646-IRV (ASCII)
>   34(22):            "  ala ISO646-IRV (ASCII)
>   35(23):            #  ala ISO646-IRV (ASCII)
>   36(24):            $  ala ISO646-IRV (ASCII)
>   37(25):            %  ala ISO646-IRV (ASCII)
>   38(26):            &  ala ISO646-IRV (ASCII)
>   39(27):            '  ala ISO646-IRV (ASCII)
>   40(28):            (  ala ISO646-IRV (ASCII)
>   41(29):            )  ala ISO646-IRV (ASCII)
>   42(2A):            *  ala ISO646-IRV (ASCII)
>   43(2B):            +  ala ISO646-IRV (ASCII)
>   44(2C):            ,  ala ISO646-IRV (ASCII)
>   45(2D):            -  ala ISO646-IRV (ASCII)
>   46(2E):            .  ala ISO646-IRV (ASCII)
>   47(2F):            /  ala ISO646-IRV (ASCII)
>   48(30):            0  ala ISO646-IRV (ASCII)
>   49(31):            1  ala ISO646-IRV (ASCII)
>   50(32):            2  ala ISO646-IRV (ASCII)
>   51(33):            3  ala ISO646-IRV (ASCII)
>   52(34):            4  ala ISO646-IRV (ASCII)
>   53(35):            5  ala ISO646-IRV (ASCII)
>   54(36):            6  ala ISO646-IRV (ASCII)
>   55(37):            7  ala ISO646-IRV (ASCII)
>   56(38):            8  ala ISO646-IRV (ASCII)
>   57(39):            9  ala ISO646-IRV (ASCII)
>   58(3A):            :  ala ISO646-IRV (ASCII)
>   59(3B):            ;  ala ISO646-IRV (ASCII)
>   60(3C):            <  ala ISO646-IRV (ASCII)
>   61(3D):            =  ala ISO646-IRV (ASCII)
>   62(3E):            >  ala ISO646-IRV (ASCII)
>   63(3F):            ?  ala ISO646-IRV (ASCII)
>   64(40):            @  ala ISO646-IRV (ASCII)
>   65(41):            A  ala ISO646-IRV (ASCII)
>   66(42):            B  ala ISO646-IRV (ASCII)
>   67(43):            C  ala ISO646-IRV (ASCII)
>   68(44):            D  ala ISO646-IRV (ASCII)
>   69(45):            E  ala ISO646-IRV (ASCII)
>   70(46):            F  ala ISO646-IRV (ASCII)
>   71(47):            G  ala ISO646-IRV (ASCII)
>   72(48):            H  ala ISO646-IRV (ASCII)
>   73(49):            I  ala ISO646-IRV (ASCII)
>   74(4A):            J  ala ISO646-IRV (ASCII)
>   75(4B):            K  ala ISO646-IRV (ASCII)
>   76(4C):            L  ala ISO646-IRV (ASCII)
>   77(4D):            M  ala ISO646-IRV (ASCII)
>   78(4E):            N  ala ISO646-IRV (ASCII)
>   79(4F):            O  ala ISO646-IRV (ASCII)
>   80(50):            P  ala ISO646-IRV (ASCII)
>   81(51):            Q  ala ISO646-IRV (ASCII)
>   82(52):            R  ala ISO646-IRV (ASCII)
>   83(53):            S  ala ISO646-IRV (ASCII)
>   84(54):            T  ala ISO646-IRV (ASCII)
>   85(55):            U  ala ISO646-IRV (ASCII)
>   86(56):            V  ala ISO646-IRV (ASCII)
>   87(57):            W  ala ISO646-IRV (ASCII)
>   88(58):            X  ala ISO646-IRV (ASCII)
>   89(59):            Y  ala ISO646-IRV (ASCII)
>   90(5A):            Z  ala ISO646-IRV (ASCII)
>   91(5B):            [  ala ISO646-IRV (ASCII)
>   92(5C):            \  ala ISO646-IRV (ASCII)
>   93(5D):            ]  ala ISO646-IRV (ASCII)
>   94(5E):            ^  ala ISO646-IRV (ASCII)
>   95(5F):            _  ala ISO646-IRV (ASCII)
>   96(60):            `  ala ISO646-IRV (ASCII)
>   97(61):            a  ala ISO646-IRV (ASCII)
>   98(62):            b  ala ISO646-IRV (ASCII)
>   99(63):            c  ala ISO646-IRV (ASCII)
>  100(64):            d  ala ISO646-IRV (ASCII)
>  101(65):            e  ala ISO646-IRV (ASCII)
>  102(66):            f  ala ISO646-IRV (ASCII)
>  103(67):            g  ala ISO646-IRV (ASCII)
>  104(68):            h  ala ISO646-IRV (ASCII)
>  105(69):            i  ala ISO646-IRV (ASCII)
>  106(6A):            j  ala ISO646-IRV (ASCII)
>  107(6B):            k  ala ISO646-IRV (ASCII)
>  108(6C):            l  ala ISO646-IRV (ASCII)
>  109(6D):            m  ala ISO646-IRV (ASCII)
>  110(6E):            n  ala ISO646-IRV (ASCII)
>  111(6F):            o  ala ISO646-IRV (ASCII)
>  112(70):            p  ala ISO646-IRV (ASCII)
>  113(71):            q  ala ISO646-IRV (ASCII)
>  114(72):            r  ala ISO646-IRV (ASCII)
>  115(73):            s  ala ISO646-IRV (ASCII)
>  116(74):            t  ala ISO646-IRV (ASCII)
>  117(75):            u  ala ISO646-IRV (ASCII)
>  118(76):            v  ala ISO646-IRV (ASCII)
>  119(77):            w  ala ISO646-IRV (ASCII)
>  120(78):            x  ala ISO646-IRV (ASCII)
>  121(79):            y  ala ISO646-IRV (ASCII)
>  122(7A):            z  ala ISO646-IRV (ASCII)
>  123(7B):            {  ala ISO646-IRV (ASCII)
>  124(7C):            |  ala ISO646-IRV (ASCII)
>  125(7D):            }  ala ISO646-IRV (ASCII)
>  126(7E):            ~  ala ISO646-IRV (ASCII)
>  127(7F):             ???

	UNUSED

   128(80):             UNUSED
   129(81):             UNUSED
   130(82):             UNUSED
   131(83):             UNUSED
   132(84):             UNUSED
   133(85):             UNUSED
   134(86):             UNUSED
   135(87):             UNUSED
   136(88):             UNUSED
   137(89):             UNUSED
   138(8A):             UNUSED
   139(8B):             UNUSED
   140(8C):             UNUSED
   141(8D):             UNUSED
   142(8E):             UNUSED
   143(8F):             UNUSED
   144(90):             UNUSED
   145(91):             UNUSED
   146(92):             UNUSED
   147(93):             UNUSED
   148(94):             UNUSED
   149(95):             UNUSED
   150(96):             UNUSED
   151(97):             UNUSED
   152(98):             UNUSED
   153(99):             UNUSED
   154(9A):             UNUSED
   155(9B):             UNUSED
   156(9C):             UNUSED
   157(9D):             UNUSED
   158(9E):             UNUSED
   159(9F):             UNUSED

>From 160-191, the names listed are not usable as character entity names.
These characters can only be used as coded characters or numeric 
character references.

>  160(A0):       nbsp  Non breaking space
>  161(A1):      iexcl  = inverted exclamation mark 
>  162(A2):       cent  = cent sign 
>  163(A3):      pound  = pound sign 
>  164(A4):     curren  = general currency sign 
>  165(A5):        yen  = /yen =yen sign 
>  166(A6):     brvbar  = broken (vertical) bar 
>  167(A7):       sect  = section sign 
>  168(A8):        uml  =umlaut mark
>  169(A9):       copy  = copyright sign 
>  170(AA):       ordf  = ordinal indicator, feminine 
>  171(AB):      laquo  = angle quotation mark, left 
>  172(AC):             = not sign
>  173(AD):        shy  Soft Hyphen
>  174(AE):        reg  = /circledR =registered sign 
>  175(AF):       macr  =macron
>  176(B0):       ring  =ring
>  177(B1):     plusmn  = /pm B: =plus-or-minus sign 
>  178(B2):       sup2  = superscript two 
>  179(B3):       sup3  = superscript three 
>  180(B4):      acute  =acute accent
>  181(B5):      micro  = micro sign 
>  182(B6):       para  = pilcrow (paragraph sign) 
>  183(B7):     middot  = /centerdot B: =middle dot 
>  184(B8):      cedil  =cedilla
>  185(B9):       sup1  = superscript one 
>  186(BA):       ordm  = ordinal indicator, masculine 
>  187(BB):      raquo  = angle quotation mark, right 
>  188(BC):     frac14  = fraction one-quarter 
>  189(BD):       half  = fraction one-half 
>  190(BE):     frac34  = fraction three-quarters 
>  191(BF):     iquest  = inverted question mark 
>  192(C0):     Agrave  capital A, grave accent 
>  193(C1):     Aacute  capital A, acute accent 
>  194(C2):      Acirc  capital A, circumflex accent 
>  195(C3):     Atilde  capital A, tilde 
>  196(C4):       Auml  capital A, dieresis or umlaut mark 
>  197(C5):      Aring  capital A, ring 
>  198(C6):      AElig  capital AE diphthong (ligature) 
>  199(C7):     Ccedil  capital C, cedilla 
>  200(C8):     Egrave  capital E, grave accent 
>  201(C9):     Eacute  capital E, acute accent 
>  202(CA):      Ecirc  capital E, circumflex accent 
>  203(CB):       Euml  capital E, dieresis or umlaut mark 
>  204(CC):     Igrave  capital I, grave accent 
>  205(CD):     Iacute  capital I, acute accent 
>  206(CE):      Icirc  capital I, circumflex accent 
>  207(CF):       Iuml  capital I, dieresis or umlaut mark 
>  208(D0):        ETH  capital Eth, Icelandic 
>  209(D1):     Ntilde  capital N, tilde 
>  210(D2):     Ograve  capital O, grave accent 
>  211(D3):     Oacute  capital O, acute accent 
>  212(D4):      Ocirc  capital O, circumflex accent 
>  213(D5):     Otilde  capital O, tilde 
>  214(D6):       Ouml  capital O, dieresis or umlaut mark 
>  215(D7):             multiply sign
>  216(D8):     Oslash  capital O, slash 
>  217(D9):     Ugrave  capital U, grave accent 
>  218(DA):     Uacute  capital U, acute accent 
>  219(DB):      Ucirc  capital U, circumflex accent 
>  220(DC):       Uuml  capital U, dieresis or umlaut mark 
>  221(DD):     Yacute  capital Y, acute accent 
>  222(DE):      THORN  capital THORN, Icelandic 
>  223(DF):      szlig  small sharp s, German (sz ligature) 
>  224(E0):     agrave  small a, grave accent 
>  225(E1):     aacute  small a, acute accent 
>  226(E2):      acirc  small a, circumflex accent 
>  227(E3):     atilde  small a, tilde 
>  228(E4):       auml  small a, dieresis or umlaut mark 
>  229(E5):      aring  small a, ring 
>  230(E6):      aelig  small ae diphthong (ligature) 
>  231(E7):     ccedil  small c, cedilla 
>  232(E8):     egrave  small e, grave accent 
>  233(E9):     eacute  small e, acute accent 
>  234(EA):      ecirc  small e, circumflex accent 
>  235(EB):       euml  small e, dieresis or umlaut mark 
>  236(EC):     igrave  small i, grave accent 
>  237(ED):     iacute  small i, acute accent 
>  238(EE):      icirc  small i, circumflex accent 
>  239(EF):       iuml  small i, dieresis or umlaut mark 
>  240(F0):        eth  small eth, Icelandic 
>  241(F1):     ntilde  small n, tilde 
>  242(F2):     ograve  small o, grave accent 
>  243(F3):     oacute  small o, acute accent 
>  244(F4):      ocirc  small o, circumflex accent 
>  245(F5):     otilde  small o, tilde 
>  246(F6):       ouml  small o, dieresis or umlaut mark 
>  247(F7):             divide sign
>  248(F8):     oslash  small o, slash 
>  249(F9):     ugrave  small u, grave accent 
>  250(FA):     uacute  small u, acute accent 
>  251(FB):      ucirc  small u, circumflex accent 
>  252(FC):       uuml  small u, dieresis or umlaut mark 
>  253(FD):     yacute  small y, acute accent 
>  254(FE):      thorn  small thorn, Icelandic 
>  255(FF):       yuml  small y, dieresis or umlaut mark