That's why the spec says what it says the way it says it:
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_6.html#SEC63
|... each block structuring element is regarded as a paragraph by
|taking the data characters in its content and the content of its
|descendant elements, concatenating them, and splitting the result into
|words, separated by space, tab, or record end characters (and perhaps
|hyphen characters). The sequence of words is typeset as a paragraph by
|breaking it into lines.
Firstly, this paragraph says "should," so it's not binding.
But here's the way I'd like to see it done:
1. break the body into block-structuring elements (and headings).
2. Take the data characters in the content of a block
structuring element (or heading) and its descendants, and break
it into words, delmited by spaces.
3. Typset the words into paragraphs. Put as much space
between words as necessary to make it look nice, independent
of where the spaces were in the source.
This little perl ditty may help illustrate:
$html = <<EOF;
<li> w1
w2
w3 <em>w4 w5 </em>
w6 <li>w7 w8
w9 w10
<li>w11
EOF
@paras = split(/<li>/, $html); # split body into paragraphs
# a real parser would do this
# as per SGML
shift(@paras); # perl's split operator creates an empty para
# before the first <li>
grep(s-<(/)?\w+>--g, @paras); # get rid of markup inside para -- we're
# only interested in data chars here.
print "start with this:\n$html";
foreach $p (@paras){
$p =~ s/^\s+//; # get rid of leading space
$p =~ s/\s+$//; # and trailing space.
@words = split(/\s+/, $p);
print "\ntypeset these words into a para: ", join(',', @words), "\n";
}
its output is this:
start with this:
<li> w1
w2
w3 <em>w4 w5 </em>
w6 <li>w7 w8
w9 w10
<li>w11
typeset these words into a para: w1,w2,w3,w4,w5,w6
typeset these words into a para: w7,w8,w9,w10
typeset these words into a para: w11
Dan