Re: Correct syntax of <LI> tags

Daniel W. Connolly (connolly@beach.w3.org)
Thu, 22 Jun 95 10:20:31 EDT

In message <9506221321.AA10318@mailer.oclc.org>, Ian Graham writes:
>
>Murray made the suggestion:
>
>> .....
>>
>> Whitespace (spaces and tabs) between tags and data characters
>> should be discarded and and multiple spaces collapsed
>> into a single space during this process.
>
>If the above were applied, what would happen with things like:
>
> data <tag> data <tag>....
>
>Would this be compressed to data<tag>data...., and completely
>remove the whitespace separating the text data?

That's why the spec says what it says the way it says it:

http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_6.html#SEC63

|... each block structuring element is regarded as a paragraph by
|taking the data characters in its content and the content of its
|descendant elements, concatenating them, and splitting the result into
|words, separated by space, tab, or record end characters (and perhaps
|hyphen characters). The sequence of words is typeset as a paragraph by
|breaking it into lines.

Firstly, this paragraph says "should," so it's not binding.

But here's the way I'd like to see it done:
1. break the body into block-structuring elements (and headings).

2. Take the data characters in the content of a block
structuring element (or heading) and its descendants, and break
it into words, delmited by spaces.

3. Typset the words into paragraphs. Put as much space
between words as necessary to make it look nice, independent
of where the spaces were in the source.

This little perl ditty may help illustrate:

$html = <<EOF;
<li> w1
w2
w3 <em>w4 w5 </em>

w6 <li>w7 w8
w9 w10
<li>w11
EOF

@paras = split(/<li>/, $html); # split body into paragraphs
# a real parser would do this
# as per SGML

shift(@paras); # perl's split operator creates an empty para
# before the first <li>

grep(s-<(/)?\w+>--g, @paras); # get rid of markup inside para -- we're
# only interested in data chars here.

print "start with this:\n$html";

foreach $p (@paras){
$p =~ s/^\s+//; # get rid of leading space
$p =~ s/\s+$//; # and trailing space.
@words = split(/\s+/, $p);

print "\ntypeset these words into a para: ", join(',', @words), "\n";
}

its output is this:

start with this:
<li> w1
w2
w3 <em>w4 w5 </em>

w6 <li>w7 w8
w9 w10
<li>w11

typeset these words into a para: w1,w2,w3,w4,w5,w6

typeset these words into a para: w7,w8,w9,w10

typeset these words into a para: w11

Dan