Proposal: No RE character in HTML

"Daniel W. Connolly" <connolly@hal.com>
Message-id: <9406101738.AA07940@ulua.hal.com>
To: html-ig@oclc.org
Subject: Proposal: No RE character in HTML
Date: Fri, 10 Jun 1994 12:38:51 -0500
From: "Daniel W. Connolly" <connolly@hal.com>
Content-Length: 3790

I've already made this change to the spec, but it's subtle and folks
might not have noticed it. I want to be sure everyone's aware of this
detail. Perhaps I need to explain it more in the spec...

There are some complex and subtle rules in the SGML standard that say
that in some cases, record end characters (kinda like newlines) get
"ignored," i.e. they're not reported to the application as part of the
ESIS.

The proposal is to eliminate this messing around with newlines by
changing the SGML declaration for HTML so that there is no character
that plays the role of RE. LF and CR characters act like tabs: they
are treated as whitespace in the relavant ways (e.g. in ELEMENT
content), but they are never "ignored."

from $Id: html.decl,v 1.6 1994/05/18 17:23:34 connolly Exp $:

         FUNCTION
              --  SPACE       32
                  TAB SEPCHAR  9
                  LF  SEPCHAR 10
                  FF  SEPCHAR 12
                  CR  SEPCHAR 13 --

        -- The above is an accurate description of the usage of FUNCTION --
        -- characters in HTML implementations; that is, there is no      --
        -- Record Start or Record End character, and no occurences of    --
        -- character 10 or 13 are "ignored" by the parser.               --
        -- But because few SGML implementations support this concrete    --
        -- sytax, we include the one below.                              --

        -- Note that in order to get correct behaviour w.r.t. newline    --
        -- processing, you will have to play some tricks in construcing  --
        -- the document entity for parsing in order to keep the parser   --
        -- from ignoring newlines in surpirsing ways                     --

                  RE          13
                  RS          10
                  SPACE       32
                  TAB SEPCHAR  9


Here's an example. The same sentence is repeated twice in this file.
The only difference is that there's a TAB characters at the ends of a
couple lines <em> in the second repetition.

With the default SGML declaration, the result is that there is no
space around the word "case" in the first repetition, but there is in
the second:


% cat implementors-guide/newlines.html 
<title>newline test</title>
<p>
Here's a strange<em>
case
</em>that might surpirse you.
<p>
Here's a strange<em>	
case	
</em>that might surpirse you.

% sgmls html.decl doctype.sgml implementors-guide/newlines.html 
(HEAD
(TITLE
-newline test
)TITLE
)HEAD
(BODY
(P
-Here's a strange
(EM
-case
)EM
-that might surpirse you.
)P
(P
-Here's a strange
(EM
-\011\ncase\011
)EM
-that might surpirse you.
)P
)BODY
)HTML
C



SGMLs doesn't support the modified SGML declaration, but I can
simulate the effect by changing all newline characters to tabs before
sending them in to sgmls:

% perl -pe 's/\n/\t/' implementors-guide/newlines.html |
	SGML_PATH="%N.dtd:%N.sgml" sgmls html.decl doctype.sgml -
(HTML
(HEAD
(TITLE
-newline test
)TITLE
)HEAD
(BODY
(P
-\011Here's a strange
(EM
-\011case\011
)EM
-that might surpirse you.\011
)P
(P
-\011Here's a strange
(EM
-\011\011case\011\011
)EM
-that might surpirse you.\011
)P
)BODY
)HTML
C


I believe it will be very confusing for authors to have whitespace
vanish under these subtle conditions. Folks won't generally run
into it, but it makes a significant difference as to how PRE works.
For example, with the default SGML declaration, the following
are equivalent:

<pre>twelve chars</pre>

and:

<pre>
twelve chars
</pre>

This might be useful in some cases, but in those cases, I think the
newline munging should be done on the application side, and not on the
parsing side.

And getting the rules for ignoring RE's _just right_ in the WWWLibrary
common is really much more trouble than it's worth.

Dan