Re: Correct syntax of <LI> tags

Murray Maloney (murray@sco.COM)
Thu, 22 Jun 95 11:12:08 EDT

Daniel W. Connolly writes:
> In message <9506221321.AA10318@mailer.oclc.org>, Ian Graham writes:
> >Murray made the suggestion:
> >> .....
> >> Whitespace (spaces and tabs) between tags and data characters
> >> should be discarded and and multiple spaces collapsed
> >> into a single space during this process.
> >If the above were applied, what would happen with things like:
> > data <tag> data <tag>....
> >Would this be compressed to data<tag>data...., and completely
> >remove the whitespace separating the text data?
>
> That's why the spec says what it says the way it says it:
> http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_6.html#SEC63
>
> |... each block structuring element is regarded as a paragraph by
> |taking the data characters in its content and the content of its
> |descendant elements, concatenating them, and splitting the result into
> |words, separated by space, tab, or record end characters (and perhaps
> |hyphen characters). The sequence of words is typeset as a paragraph by
> |breaking it into lines.

Dan explains it all very well, and I take back my earlier comments.

If I didn't want to see 2.0 ratified soon, I might suggest that
at least a part of Dan's explanation should be included in the spec,
but I don't think that we can afford the time. On a related note,
the description of DT/DD still does not match the DTD with respect
to multiple <DD>s, but I don't think that it is worth delaying the
spec any longer if it can be avoided.

>
> Firstly, this paragraph says "should," so it's not binding.
>
> But here's the way I'd like to see it done:
> 1. break the body into block-structuring elements (and headings).
>
> 2. Take the data characters in the content of a block
> structuring element (or heading) and its descendants, and break
> it into words, delmited by spaces.
>
> 3. Typset the words into paragraphs. Put as much space
> between words as necessary to make it look nice, independent
> of where the spaces were in the source.
>
> This little perl ditty may help illustrate:
>
> $html = <<EOF;
> <li> w1
> w2
> w3 <em>w4 w5 </em>
>
> w6 <li>w7 w8
> w9 w10
> <li>w11
> EOF
>
> @paras = split(/<li>/, $html); # split body into paragraphs
> # a real parser would do this
> # as per SGML
>
> shift(@paras); # perl's split operator creates an empty para
> # before the first <li>
>
> grep(s-<(/)?\w+>--g, @paras); # get rid of markup inside para -- we're
> # only interested in data chars here.
>
> print "start with this:\n$html";
>
> foreach $p (@paras){
> $p =~ s/^\s+//; # get rid of leading space
> $p =~ s/\s+$//; # and trailing space.
> @words = split(/\s+/, $p);
>
> print "\ntypeset these words into a para: ", join(',', @words), "\n";
> }
>
> its output is this:
>
> start with this:
> <li> w1
> w2
> w3 <em>w4 w5 </em>
>
> w6 <li>w7 w8
> w9 w10
> <li>w11
>
> typeset these words into a para: w1,w2,w3,w4,w5,w6
>
> typeset these words into a para: w7,w8,w9,w10
>
> typeset these words into a para: w11
>
>
> Dan