Re: Line break canonicalization in HTTP servers

Marc VanHeyningen (mvanheyn@cs.indiana.edu)
Thu, 13 Oct 1994 06:31:21 +0100

By the way, I should start off by mentioning that, if you really want
to reach the community of WWW developers involved in this, you
probably should be posting to the www-talk mailing list instead of
here. I'm cc'ing this to that list.

Thus said eps@cs.sfsu.edu:
>In article <13350.781886473@hound.cs.indiana.edu>
> Marc VanHeyningen <mvanheyn@cs.indiana.edu> writes:
>>I don't know what "anything with the character of text" means,
>>frankly... could you define that a little more rigorously?
>
>Content-Type: text/*
>Content-Type: image/x-xbitmap
>Content-Type: image/x-xpixmap
>+ Also, a number of application subtypes
>
>(For the UNIX weenies out there: if it makes sense to edit it
>with "vi," it has the character of text.)
>
>Does that clarify things, or do you need something more rigorous?

Well, what you need is a precise algorithm by which a server can
decide whether a particular object is textual or not, so it knows
whether or not to canonicalize line breaks. Is application/postscript
a textual type, for instance? application/pdf? application/rtf?
application/mac-binhex40?

These issues are not totally settled in the MIME community. The
suggested representation of this information is in mailcap files of
"textualnewlines" as a flag. I think it'd be great for HTTP servers
to understand this flag (and, more generally, for mailcap mechanisms
to be extended to allow file typing information to be associated with
content-types so that HTTP servers could use mailcap files as a
general mechanism for labeling outgoing files) but I don't expect this
to happen real soon.

>>HTTP doesn't use 7bit transfer encoding.
>
>Let me make sure I understand what's going on: there's no HTTP
>RFC that says this, there's no current Draft RFC that says this
>(only a note that one was deleted September 1), and although
>HTTP uses the same headers with the same meanings as other
>protocols that all behave in a consistent (and well-understood)
>manner, HTTP has to be gratutiously incompatible. And we're all
>supposed to be psychic and know this already, and then look the
>other way. Right. Which planet did you say you were from?

There's no need to be rude, it wasn't my decision (nor the decision of
any one individual.) The spec could be made more clear on this point
(and lots of other points too.) Mostly, though, it's just that
server authors didn't worry about the issue, and it didn't break
anything, so it wasn't worried about.

>>In an HTTP context, "binary" is the default CTE. Such a header would
>>be redundant (indeed, I think some clients would barf on it, alas.)
>
>That's a very poor choice; it makes HTTP useless as a transport
>for text/plain and text/html (not to mention those MIME types
>that are explicitly 7bit-only). I don't think that's what you
>intended. Nor do I feel good about making 8bit the default,
>however tempting that may seem. You could propose a new "unix"
>CTE default, and try to get that wired into the HTTP spec. But
>that's not going to go over well with the standards track folks.

Whom exactly are you addressing as "you"? I didn't write the spec for
HTTP, though I did help write one particular implementation of it.
The people who did are mostly not here. Bashing the authors of free
software, in a forum which is not the forum in which discussion of the
HTTP protocol normally takes place, is unlikely to elicit good
results.

Your complaint doesn't make much sense. "binary" is a superset of
"8bit" and "7bit"; the reasons the latter exist is so that content
that may need encoding to get through hostile transport agents is
handled by gateways. It doesn't make sense in a native binary-clean
environment, which HTTP is. Other than data possibly passing through
an HTTP-to-mail gateway, what purpose is served by the labeling you
suggest?

>It all comes down to this: why should a server explicitly
>claiming MIME-Version: 1.0 in response to a GET *knowingly*
>violate RFC 1521?

I am inclined to agree with you that the MIME version header
could/should be dropped. HTTP uses MIME content-types, but they've
since been renamed "Internet Media Types" and are used in plenty of
contexts other than email.

>Server view: (conservative)
>
> text/plain should be transmitted with <CR><LF> as the line
> separator. If the content contains only us-ascii characters,
> Content-Transfer-Encoding: need not be present, but if used,
> shall not be 8bit. If the content contains any characters
> with (128-255) octet values, Content-Transfer-Encoding: is
> required, and 8bit is appropriate. If the character set is not
> iso-8859-1 (or us-ascii), charset= must be specified.
> Documents in other character sets, which can be expressed in
> us-ascii or iso-8859-1 without conversion loss, should be
> converted.

This differs from RFC 1521 in several important ways. MIME doesn't
allow iso-8859-1 to be the default charset, nor does MIME forbid 8bit
CTE for content with no 8bit data (though such is obviously a bad idea
in the context of email from the standpoint of maximizing
interoperability.) I thought your point was that 1521 should be
followed???

The only substantive point you seem to have is that text should have
its line breaks canonicalized into CRLF instead of being sent out with
UNIX-centric newlines. On this I agree with you.

> text/html should be transmitted with <CR><LF> as the line
> separator. Content-Transfer-Encoding: 8bit may be used if
> the content contains any characters with (128-255) octet
> values, but since such characters can be expressed as
> numeric entities, HTML authors should be strongly
> encouraged to keep their documents us-ascii conformant.

Does text/html even allow a "charset" parameter? Where is this
defined?

> Content-Transfer-Encoding: binary should not be used in
> conjunction with text types.

So how should text files that have lines longer than those permitted
by 8bit CTE be sent? Quoted-printable? :-) Where in RFC 1521 is
sending text with a CTE of binary through a binary-clean transport
forbidden?

Look, you're right that there are several areas where the HTTP
community has sort of wandered away from the way the MIME community
has set up to do things. Many existing clients' integration with MIME
is bordering on laughable, and some real issues, like line break
canonicalization or handling content-type parameters, haven't been
appropriately addressed. In other cases, the divergence is quite
reasonable. MIME's CTE family is intended to work around problems
which simply don't exist in HTTP.

--
Marc VanHeyningen  <http://www.cs.indiana.edu/hyplan/mvanheyn.html>