Dan explains it all very well, and I take back my earlier comments.
If I didn't want to see 2.0 ratified soon, I might suggest that
at least a part of Dan's explanation should be included in the spec,
but I don't think that we can afford the time. On a related note,
the description of DT/DD still does not match the DTD with respect
to multiple <DD>s, but I don't think that it is worth delaying the
spec any longer if it can be avoided.
>
> Firstly, this paragraph says "should," so it's not binding.
>
> But here's the way I'd like to see it done:
> 1. break the body into block-structuring elements (and headings).
>
> 2. Take the data characters in the content of a block
> structuring element (or heading) and its descendants, and break
> it into words, delmited by spaces.
>
> 3. Typset the words into paragraphs. Put as much space
> between words as necessary to make it look nice, independent
> of where the spaces were in the source.
>
> This little perl ditty may help illustrate:
>
> $html = <<EOF;
> <li> w1
> w2
> w3 <em>w4 w5 </em>
>
> w6 <li>w7 w8
> w9 w10
> <li>w11
> EOF
>
> @paras = split(/<li>/, $html); # split body into paragraphs
> # a real parser would do this
> # as per SGML
>
> shift(@paras); # perl's split operator creates an empty para
> # before the first <li>
>
> grep(s-<(/)?\w+>--g, @paras); # get rid of markup inside para -- we're
> # only interested in data chars here.
>
> print "start with this:\n$html";
>
> foreach $p (@paras){
> $p =~ s/^\s+//; # get rid of leading space
> $p =~ s/\s+$//; # and trailing space.
> @words = split(/\s+/, $p);
>
> print "\ntypeset these words into a para: ", join(',', @words), "\n";
> }
>
> its output is this:
>
> start with this:
> <li> w1
> w2
> w3 <em>w4 w5 </em>
>
> w6 <li>w7 w8
> w9 w10
> <li>w11
>
> typeset these words into a para: w1,w2,w3,w4,w5,w6
>
> typeset these words into a para: w7,w8,w9,w10
>
> typeset these words into a para: w11
>
>
> Dan