CR not stripped properly on FTP transfers

Bill Janssen <janssen@parc.xerox.com>
Message-id: <0g6Fu48B0KGW5geexr@holmes.parc.xerox.com>
Date: 	Fri, 11 Jun 1993 17:08:36 PDT
Sender: Bill Janssen <janssen@parc.xerox.com>
From: Bill Janssen <janssen@parc.xerox.com>
To: timbl@www3.cern.ch
Subject: CR not stripped properly on FTP transfers
Cc: www-talk@nxoc01.cern.ch
OK, I think I see the problem.

1)  HTFTP.c always uses ASCII mode FTP transfers.

2)  In ASCII mode transfers of UNIX source documents, most UNIX ftp
servers always put a CR before every LF.

[Note:  It's not clear under what conditions this *should* be done (On
UNIX, every file is in some sense a binary file, interpretable only by
an "understanding" program.  On other system, say VMS, the notion of
"line" is in the file system for line-oriented files, and VMS FTP
servers might choose, for non-line-oriented files being transmitted in
ASCII mode, not to put CR before every line feed, but just at the end of
every block, or not at all (after all, no "lines").).  The RFC for FTP
says:

>         3.1.1.1.  ASCII TYPE

>             This is the default type and must be accepted by all FTP
>             implementations.  It is intended primarily for the transfer
>             of text files, except when both hosts would find the EBCDIC
>             type more convenient.

>             The sender converts the data from an internal character
>             representation to the standard 8-bit NVT-ASCII
>             representation (see the Telnet specification).  The receiver
>             will convert the data from the standard form to his own
>             internal form.

>             In accordance with the NVT standard, the <CRLF> sequence
>             should be used where necessary to denote the end of a line
>             of text.  (See the discussion of file structure at the end
>             of the Section on Data Representation and Storage.)

Since HTFTP never seems to transmit the STRU command, we assume that the
file is transferred with structure "file-structure", which means,
according to the RFC, that

>          3.1.2.1.  FILE STRUCTURE

>             File structure is the default to be assumed if the STRUcture
>             command has not been used.

>             In file-structure there is no internal structure and the
>             file is considered to be a continuous sequence of data
>             bytes.

The TELNET RFC (which I believe is still 854; does something later
obsolete it?) says about CR and LF,

>       Therefore, the sequence "CR LF" must be treated as a single "new
>       line" character and used whenever their combined action is
>       intended; the sequence "CR NUL" must be used where a carriage
>       return alone is actually desired; and the CR character must be
      avoided in other contexts.

So my reading would be that for UNIX files being transmitted with mode =
"ASCII" and structure = "file-structure", the sender would have to put a
NUL character after every CR, and should *not* put a CR in front of any
LF, since there really are no "lines of text" in the UNIX file system.

This is probably why I'm not a net.wizard.

end Note]

3)  WWW should then change CR-LF pairs to simple LF, but there is no
routine which does this (HTCopyNoCR discards *all* CR's, which is OK for
text documents (usually), but not for binary documents like tar files). 
There's also no easy way in HTFormat.c to "think" about this, because
it's a characteristic of the input stream, not of the input format
(which is something more like "www/mime").

4)  Because of this, "binary" documents are copied to the output with
HTCopy, which does not strip any characters, even the ones which
*should* be stripped.

So the problem is that the actions of the FTP protocol are not being
reversed anywhere.  This particular trait is a characteristic of the
transfer protocol being used, not of the document's input format or of
the transfer-encoding.  Seems like the right thing to do is to remove
the simple file_number model used in HTFormat.c, and replace it with an
input stream, which could then be specialized as necessary to remove the
actions of the transfer protocol, if necessary.  Another possibility
would be to always use BINARY mode transfers in the FTP module (since
that's what HTFormat.c seems to expect), and add CR if necessary in the
WWW library.  Perhaps both of these should be implemented.

Bill