Re: dealing with new-lines

Thomas A. Fine <fine@cis.ohio-state.edu>
Date: Fri, 8 Jan 93 15:38:20 -0500
From: Thomas A. Fine <fine@cis.ohio-state.edu>
Message-id: <9301082038.AA14229@soccer.cis.ohio-state.edu>
To: connolly@pixel.convex.com, @cis.ohio-state.edu@cis.ohio-state.edu
Subject: Re: dealing with new-lines 
Cc: www-talk@nxoc01.cern.ch
X-Mailer: Perl Mail System v1.1

>Darn good question. Your approach appears to have the correct
>results, but I'm not sure it's practical for many implementations
>(global search-and-replace operations are inconvenient for
>sequential processing models), and it certainly isn't a healthy
>way to think about SGML documents.

But most browsers seem to have cacheing anyway, which means they can do
global search/replace.  But you can still do it more or less sequentially.
Just buffer strings of new-lines until you know what follows them, and
then deal with it.  There's no method you can propose which is correct
and doesn't involve storing something somewhere.

>The way to think about SGML documents, IMHO, is this: the sequence
>of characters comprising an SGML document are presented to an
>SGML parser, which parses the markup from the data and passes
>the "results" to the processing application.

This is another alternative I considered.  But I figured that I have to
deal with various parsing things when I read the HTML anyway.  I was
just going to take each chunk of data, (with anchors pre-processed out)
and remove all whitespace at the beginning and end (except for PRE sections
and such).  But if someone put in whitespace, why should I muck with it?
Who knows, they might have even wanted it there.

>>1. For each tag NOT in
>>     <PRE> </PRE> <A> </A> <PLAINTEXT>
>>   remove ALL surrounding new-lines.
>
>First, let's get one thing straight: the PLAINTEXT element as
>described by the original HTML documentation is not representable
>in SGML. For my purposes, I consider the HTML document to
>end at the <PLAINTEXT> tag, and I consider the rest of the
>data stream to be an RFC-822 message body or a MIME text/plain body,
>and not SGML at all.

I hadn't meant otherwise.  But you have to read it in anyway, and since
my method deals with things prior to any other parsing, you treat it
all as one clump.

>Next, let's keep in mind that you can't do things like the following
>global substitition,
>s/\n+(<(H1|H2|ADDRESS...))>/$2/g;
>because it might find things that look like tags but aren't,
>for example
>
><foo bar="
><H1>this is a little cooky, but nontheless legal and possible.">
>
>But even if you're using a proper SGML parser, consider:
>
><H1>Here we go!
><a href="#xyz">click here</a>
>There we went!
></H1>
>
>The parser will return an H1 start tag, and then the
>string "Here we go!\n". At this point, your rule doesn't
>tell me what to do with the newline. I have to get
>the next object before I decide.

Like I said before, You have to do some sort of storage at some point
anyway.

>Hmm... I guess that's reasonable. But I'd rather just pass all the

Like I said before, You have to do some sort of storage at some point
anyway.

>My point is: don't use whitespace to represent significant
>information except in the PRE elemnt. Use the tags that
>are defined to have significance.

I suppose I agree with this more or less, at least from the point of view
of generating my own code.  But we have to make something clear - can
a browser keep all the whitespace if it wants to?  Or in other words,
can an html generator assume collapsing whitespace, or just be aware
that it might happen?

	 tom