Re: [www-mling,00154] charset parameter (long)

Bob Jung (bobj@mcom.com)
Tue, 10 Jan 95 00:50:23 EST

The goal of my proposal is to
(1) Provide a means for new servers & browsers to correctly
handle existing (unmodified) Web data in various character
set encodings.
(2) Not to break the current servers and browsers (anymore than
they are already) with regards to handling these code sets.

The proposal does not try to fix things that are broken in existing
clients/servers.

I agree with Larry Masinter <masinter@parc.xerox.com>
that we should replace the

Accept-charset=xxx
with the
accept-parameter charset=xxx

request header in this proposal. Larry, thanks for the update.

Here are my replies to the thoughtful comments of:
Daniel W. Connolly <connolly@hal.com>
Ken Itakura <itakura@jrdv04.enet.dec-j.co.jp>

Daniel>|7.1.1. The charset parameter
Daniel>|
Daniel>| [...]
Daniel>|
Daniel>| The default character set, which
Daniel>| must be assumed in the absence of a charset parameter, is US-ASCII.
Daniel>
Daniel>This conflicts somewhat with your proposal. However, the RFC goes on
Daniel>to say...
Daniel>
Daniel>| The specification for any future subtypes of "text" must specify
Daniel>| whether or not they will also utilize a "charset" parameter, and may
Daniel>| possibly restrict its values as well.
Daniel>
Daniel>I wonder if changing the default from "US-ASCII" to
Daniel>"implementation-dependent" can be considered "restricting the values"
Daniel>of the charset parameter.

I agree that if the charset parameter is not specified, the default
***should*** be US-ASCII (or ISO8859-1, if it's been changed).

Unfortunately, since charset was reserved for future use, Japanese servers
had no choice but to serve non-Latin files without a charset parameter!!

Why don't we enforce the default for servers using a future version of the
HTTP protocol? ...and let current versions be "implementation dependent"
in order to preserve backwards compatibility?

Daniel>I suppose the relevant scenario is where an info provider serves up
Daniel>an ISO2022-JP document with a plain old:
Daniel> Content-Type: text/plain
Daniel>header. I gather that this is current practice.

Yes, this is the current practice (for text/html too). Additionally, some
files are sent in SJIS and EUC code set encodings with the same headers.

Daniel>That intent is already mucked up somewhat by the fact that normal html
Daniel>documents are allowed to have bytes>127, which are normally
Daniel>interpreted as per ISO8559-1. So we already have the situation where a
Daniel>conforming HTTP client, say on a DOS box might retreive a text/html
Daniel>document and pass it over to a conforming MIME user agent, which would
Daniel>then blast it to the screen. The user would lose, cuz the bytes>127
Daniel>would get all fouled up.

Yes, this situation is broken for current browsers/servers and I do not
propose to fix it. Using my proposal, a new DOS browser would send:

accept-charset=x-pc850

and a new server would send back:

Content-Type: text/html; charset=ISO8859-1

and the new DOS browser should convert it to PC 850 for rendering.

Daniel>But... back to the case of ISO2022-JP encoded data tagged as plain
Daniel>"text/html". The business of slapping "charset=ISO8559-1" on the end
Daniel>would muck things up. So where do we assign fault?
Daniel>My vote is to assign fault at the information provider for serving up
Daniel>-JP encoded data without tagging it as such.

We are not trying to fix existing browsers/servers.

If a new charset-enable server slaps the wrong charset header (or fails
to slap on a header for non-Latin1) to a new charset-enabled browser, it
is the server's fault.

Daniel>So all those information providers serving up ISO2022-JP data without
Daniel>tagging it as such are violating the protocol. This doesn't prevent
Daniel>NetScape and other vendors from hacking in some heuristic way to
Daniel>handle such a protocol violation. But the spec shouldn't condone this
Daniel>behaviour.

Unfortunately, the spec is lagging behind the implementations. The spec
did not provide a means for the existing servers to resolve this problem.
Pragmatically, I cannot introduce a server or client product that breaks
established conventions.

As mentioned above, can't this be handled with HTTP versioning?

HTTP V1.0 && no charset paramter == implementation defined
HTTP V3.0(?) && no charset parameter == IS08859-1

Daniel>Ok... so now let's suppose all the information providers agree to
Daniel>clean up their act. Somehow, they have to get their HTTP servers to
Daniel>tag -JP documents as such.
Daniel>
Daniel>How do they do this? File extension mappings? It's not relavent to the
Daniel>HTML or HTTP specs,
Daniel>but I think the overall proposal is incomplete
Daniel>until we have a workable proposal for how to enhance the major httpd
Daniel>implementations to correctly label non-ISO8559-1 documents.

Yes, I explicitly left this out my proposal, but you're right, we need
to discuss the implications.

Ken> - Before encouragement to label correctly for non-ISO8859-1, we must
Ken> give servers the way to know what they should label it. Otherwise,
Ken> nobody can blame the server that distribute illegal information.
Ken>
Ken>The third one has the difficult problem. For the situation for the mail
Ken>may simple, since the user knows he knows what encoding he use now, so
Ken>he can specify the correct label before sending it. (The user who doesn't
Ken>know about encoding at all must not use default encoding.) But for the
Ken>situation for the web documents is difficult. I think the file extension
Ken>mapping nor the classification by the directory structure is not suitable.
Ken>My current Idea is 'server default' + 'directory default' + 'mapping file'.
Ken>But I myself don't like my idea. Does anyone have more elegant idea?

Initally, I assume most web data will be configured on a directory or file
basis. I imagine most files will be configured by what directory they live
in. This should be relatively easy extension for existing servers and
how they parse their config files.

Files like Japanese .fj newsgroups (in ISO2022-JP) are already organized by
directories. So are a lot of Japanese Web pages.

A web site with versions of the same files in different encodings
(e.g., SJIS, EUC and JIS) or languages (e.g., English and Japanese) could
create separately rooted trees with the equivalent files in each tree.
The top page could say click here for SJIS/EUC/JIS or English/Japanese.

File-by-file basis would be supported too, but I'd expect this to be
used infrequently. Besides, this would be a Web server administrator's
nightmare to maintain the configuration database.

I don't like the idea of new extensions, although the current server
software probably could support this. I think the data should really
identify itself and not rely upon extensions. Also, we don't want
to make people rename their files. For example, how are you going
to rename the news archives?

<TANGENT= warning not relevant to current proposal>

Ultimately, I'd like the content itself to specify the encoding.

One idea, is for a HTML <charset> tag that would take precedence over the
MIME header:

<html>
<charset=xxx>
<head> <title> DOCUMENT TITLE GOES HERE </title> </head>
<body>
<h1> MAJOR HEADING GOES HERE </h1>

THE REST OF THE DOCUMENT GOES HERE

</body>
</html>

</TANGENT>

Daniel>Then web clients will start seeing:
Daniel>
Daniel> Content-Type: text/html; charset="ISO2022-JP"

Only new charset-enabled clients will see this.

Daniel>Many of them will balk at this and punt to "save to file?" mode.
Daniel>
Daniel>Is that a bad thing? For standard NCSA, Mosaic 2.4, no, because
Daniel>it can't do any reasonable rendering of these documents anyway.
Daniel>
Daniel>But what about the multi-localized version of Mosaic? Does it handle
Daniel>charset=... reasonably? What's the cost of enhancing it to do so and
Daniel>deploying the enhanced version?
Daniel>
Daniel>The proposal says that the server should not give the charset=
Daniel>parameter unless the client advertises support for it. I think that
Daniel>will cause more trouble than its worth (see the above scenario of
Daniel>untagged -JP documents being passed from HTTP clients to MIME user
Daniel>agents on a DOS box.)

Why is this more trouble? It's broken now and remains broken. In either
case it would ignore the charset information and guess at the
encoding (for most clients the guess would be 8859-1).

But the purpose of NOT returning the charset parameter, has to do with
not breaking the client parsing of the MIME Content-Type. If the server
always slapped charset on, current clients would parse the header:

Content-Type: text/html; charset=ISO8859-1

and think the content type was the entire 'text/html; charset=ISO8859-1'
not just 'text/html' string and would fail to read Latin1 files!!!!!

To be backwards compatible, the servers should not send the charset
parameter to old browsers.

Daniel>One outstanding question is: does text/html include all charset=
Daniel>variations or just latin1? That is, when a client says:
Daniel>
Daniel> Accept: text/html
Daniel>
Daniel>is it implying acceptance of all variations of html, or just latin1?
Daniel>
Daniel>To be precise, if a client only groks latin1, and it says accept:
Daniel>text/html, and the server sends ISO2022-JP encoded text, and the user
Daniel>loses, is the fault in the client for not supporting ISO2022-JP, or at
Daniel>the server for giving something the client didn't ask for?
Daniel>
Daniel>First, "text/html" is just shorthand for "text/html; charset=ISO8859-1"
Daniel>so the client didn't advertise support for -JP data.
Daniel>
Daniel>But "giving somethign the client didn't ask for" is _not_ an HTTP
Daniel>protocol viloation (at least not if you ask me; the ink still isn't
Daniel>dry on the HTTP 1.0 RFC though...). It's something that the client
Daniel>should be prepared for.

As you put it "It's something that the client should be prepared for."

I'm still assuming that

accept-parameter: charset=xxx

dictates if the server sends back the charset parameter. An old browser
should continue to get the 2022-JP data untagged. A new charset-enabled
browser should get tagged 2022-JP data even if it only advertised 8859-1.

Daniel>As above, the server is bound to give "charset=ISO2022-JP" if it is
Daniel>not returning latin1 data. So the client does know that it's not
Daniel>getting latin1 data. It has the responsibility to interpret the
Daniel>charset correctly, or save the data to a file or report "sorry, I
Daniel>don't grok this data" to the user. If it blindly blasts ISO2022-JP
Daniel>tagged data to an ASCII/Latin1 context, then it's broken.

I agree. I've purposely read EUC and JIS pages on my Mac (SJIS), so that I
could save the source and look (grok) at it later. (not a usual user...)

I'm glad you bring up this point, so we can consider the implications.
But what the client does in this situation should be implementation
dependent and not part of this proposal.

Daniel>Does this mean that charset negociation is completely unnecessary?
Daniel>No. It's not necessary in any of the above scenarios, but it would be
Daniel>necessary in the case where information can be provided in, for
Daniel>example, unicode UCS-2, UTF-8, UTF-7, or ISO2022-JP, but the client
Daniel>only groks UTF-8.
Daniel>
Daniel>In that case, something like:
Daniel>
Daniel> Accept-Charset: ISO8859-1, ISO2022-JP
Daniel>
Daniel>or perhaps
Daniel>
Daniel> Accept-Parameter: charset=ISO8859-1, charset=ISO2022-JP
Daniel>
Daniel>I'm not convinced of the need for the generality of the latter syntax.
Daniel>Besides: we ought to allow preferences to be specified ala:
Daniel>
Daniel> Accept-Charset: ISO8859-1; q=1
Daniel> Accept-Charset: Unicode-UCS-2; q=1
Daniel> Accept-Charset: Unicode-UTF-8; q=0.5
Daniel> Accept-Charset: Unicode-UTF-7; q=0.4
Daniel> Accept-Charset: ISO2022-JP; q=0.2
Daniel>
Daniel>which says "if you've got latin1 or UCS2, I like that just fine. If
Daniel>you have UTF-8, UTF-7, or -JP, I'll take it, but I won't like it as
Daniel>much."

Ken>I want to add one more thing about this issue. We could have the document
Ken>which uses multiple charset in future. We must define the way to label
Ken>such a document.
Ken>It can be like ...
Ken> Content-Type: text/html; charset="ISO2022-JP", charset="ISO8859-6"
Ken>Is this OK?

I'd rather leave this as a possible future direction. Multilingual has
had a lot of heated discussions. If we can agree on a means to support
the existing mono-lingual mono-encoded Web data, that will allow us
to create products to fill an immediate need. Can we phrase something that
leaves this open and discuss this in another thread?

Regards,
Bob

Bob Jung +1 415 528-2688 fax +1 415 254-2601
Netscape Communications Corp. 501 E. Middlefield Mtn View, CA 94041