Re: Interpretation of RE

Dan Connolly (connolly@w3.org)
Sun, 16 Apr 95 07:51:52 EDT

Grrrr... Record delimiters again! Curses!

Mr. Van Hoff is quite right. Common practice is pretty much
irreconcilable with the SGML standard in the example he cites.

For a while, this was in the SGML declaration for HTML:

SPACE 32
TAB SEPCHAR 9
LF SEPCHAR 10
FF SEPCHAR 12
CR SEPCHAR 13

-- The above is an accurate description of the usage of FUNCTION --
-- characters in HTML implementations; that is, there is no --
-- Record Start or Record End character, and no occurences of --
-- character 10 or 13 are "ignored" by the parser. --
-- But because few SGML implementations support this concrete --
-- sytax, we include the one below. --

-- RE 13
RS 10
SPACE 32
TAB SEPCHAR 9 --

While it "fixes" this problem, it makes life very difficult for sgmls users.

The current draft says this:

3.2.1 Conventional Representation of Newlines and Record Delimiter Characters

SGML specifies that a text entity is a sequence of records, each
beginning with a record start character and ending with a record
end character (character number 10 13 respectively).

MIME specifies that a body of type text/* is a sequence of lines,
each terminated by CRLF, that is octets 10, 13.

NOTE: In practice, HTML documents are frequently represented and
transmitted using an end of line convention that depends on the
conventions of the source of the document; frequently, that
representation consists of CR only, LF only, or CR LF
combination. Hence the decoding of the octets will often result in
a text entity with some missing record start and record end
characters.

Since there is no ambiguity, HTML user agents are encouraged to
infer the missing record start and end characters.

An HTML user agent should treat end of line in any of its
variations as a word space in all contexts except
preformatted text. Within preformatted text, an HTML user agent
should expect to treat any of the three common representations of
end-of-line as starting a new line.

This doesn't address the case below.

Frankly, I'm not sure what to do about this.

Suggestions?

Arthur van Hoff writes:
> > On Mon, 10 Apr 1995, Arthur van Hoff wrote:
> >
> > > element. If that is true, what is the correct interpretation of RE
> > > iside a PRE content? For example:
> > >
> > > <pre>
> > > This is <b>
> > > bold
> > > </b> text.
> > > </pre>
> > >
> > > Should this be interpreted as:
> > >
> > > <pre>
> > > This is <b>bold</b> text.
> > > </pre>
> >
> > No, that is not correct. <pre> means preformatted ... that is use a
> > fixed pitch font and break the lines where the user did.
> > Section 10.2 of the March 29 HTML 2 draft is quite clear. You example
> > should show as:
> >
> > This is
> > BOLD
> > text.
>
> I was afraid you would say that. Section 7.6.1 "Record Boundaries" of
> the SGML specification states:
>
> The first RE in an element is ignored if no RS, data, or
> proper subelement preceded it.
>
> The last RE in an element is ignored if no RS, data, or
> proper subelement follows it.
>
> Does this mean that HTML cannot be parsed by a strict SGML parser?