Re: [www-mling,00154] charset parameter (long)

Daniel W. Connolly (connolly@hal.com)
Tue, 10 Jan 95 02:54:12 EST

In message <199501100548.VAA17487@neon.mcom.com>, Bob Jung writes:
>Unfortunately, the spec is lagging behind the implementations. The spec
>did not provide a means for the existing servers to resolve this problem.
>Pragmatically, I cannot introduce a server or client product that breaks
>established conventions.

OK... let's agree on a perspective before we get to the details:

Yes, the specs lag behind the implementations. This is The Internet
Way! The alternative is The ISO Way. The ISO Way is Bad.

The IETF/Internet way goes like this:
1. you propose what you think is a good idea, knowing that
Murphy's law rules, and it's incomplete.

2. a bunch of folks implement it and find the issues that
you forgot.

3. A this point (where we are now) you just go back and
clarify the original description. You decide which of the
existing practices are to become "standard," which ones
are violations of the standard, and which ones are
outside the scope of the standard. Now you have a very
clear model of the original concepts from step 1, and
you write them up with extensibility and compatibility
with things coming down the pipe in mind.

4. Folks fix up the minor discrepancies with the refined
standard, and they go like mad implementing the hot, new,
unstandardized features. After a while, you revisit
step 3...

Now... back to the details...

>Daniel>|7.1.1. The charset parameter
>Daniel>|
>Daniel>| [...]
>Daniel>|
>Daniel>| The default character set, which
>Daniel>| must be assumed in the absence of a charset parameter, is US-ASCII.
>Daniel>
>Daniel>This conflicts somewhat with your proposal.
>
>I agree that if the charset parameter is not specified, the default
>***should*** be US-ASCII (or ISO8859-1, if it's been changed).
>
>Unfortunately, since charset was reserved for future use, Japanese servers
>had no choice but to serve non-Latin files without a charset parameter!!

Right. They found an issue not covered in the original spec. The question
is whether we should call this legal, illegal, or outside of scope.

>Why don't we enforce the default for servers using a future version of the
>HTTP protocol? ...and let current versions be "implementation dependent"
>in order to preserve backwards compatibility?

Well... we can call it illegal and put a note in the spec that says
"look out for violations" or we can say that it's undefined -- a
client can't count on seeing latin1 when there's no charset= param.

I think that calling it illegal is more consistent with current
practice: lots of clients _do_ expect to see latin1. And it's
consistent with the MIME spec.

In my book, if you have a conforming client talking to a conforming
server, you shouldn't be able to observe any faults. What you get
with, say, doslynx tring to display a japanese file counts as a fault
in my book. I think the spec should say that somebody was doing
something wrong there.

>Daniel>But... back to the case of ISO2022-JP encoded data tagged as plain
>Daniel>"text/html". The business of slapping "charset=ISO8559-1" on the end
>Daniel>would muck things up. So where do we assign fault?
>Daniel>My vote is to assign fault at the information provider for serving up
>Daniel>-JP encoded data without tagging it as such.
>
>We are not trying to fix existing browsers/servers.

But we are trying to classify their behaviour as
conforming/illegal/out-of-scope.

>If a new charset-enable server slaps the wrong charset header (or fails
>to slap on a header for non-Latin1) to a new charset-enabled browser, it
>is the server's fault.

Agreed.

>Daniel>So all those information providers serving up ISO2022-JP data without
>Daniel>tagging it as such are violating the protocol. This doesn't prevent
>Daniel>NetScape and other vendors from hacking in some heuristic way to
>Daniel>handle such a protocol violation. But the spec shouldn't condone this
>Daniel>behaviour.
>
>Unfortunately, the spec is lagging behind the implementations. The spec
>did not provide a means for the existing servers to resolve this problem.
>Pragmatically, I cannot introduce a server or client product that breaks
>established conventions.

You _can_ deploy a client that goes above and beyond the call of
duty in supporting broken servers. It's done all the time, and
it makes the users smile.

>As mentioned above, can't this be handled with HTTP versioning?
>
> HTTP V1.0 && no charset paramter == implementation defined
> HTTP V3.0(?) && no charset parameter == IS08859-1

I don't think this special case is motivated...

>Daniel>Then web clients will start seeing:
>Daniel>
>Daniel> Content-Type: text/html; charset="ISO2022-JP"
>
>Only new charset-enabled clients will see this.

All clients will see this. Why not?

The alternative is that information providers are motivated to write
"you need a japanese-happy browser to follow this link..." Sending the
charset= parameter to old clients causes them to think -- rightly --
that they've got data that they can't reliably display, and that they
should offer to save to a file or whatever.

>Daniel>The proposal says that the server should not give the charset=
>Daniel>parameter unless the client advertises support for it. I think that
>Daniel>will cause more trouble than its worth (see the above scenario of
>Daniel>untagged -JP documents being passed from HTTP clients to MIME user
>Daniel>agents on a DOS box.)
>
>Why is this more trouble?

A server has to say "hmmm... did the client advertise support for
charset parameter? No... leave it off and hope." That's (1) certainly
more code than not doing the check, and (2) hoping against terrible
odds.

> It's broken now and remains broken.

The case of sending untagged japanese text is broken. Tagging it with
an appropriate charset will result in _more_ reliable behaviour, if
anything (consider UCS-2 data with nulls in it: more browsers would
reliably save it to a file if they didn't consider it text.)

> In either
>case it would ignore the charset information and guess at the
>encoding (for most clients the guess would be 8859-1).

I'm suggesting that there is no ignoring nor guessing: it's 8559-1, or
it's tagged otherwise.

>But the purpose of NOT returning the charset parameter, has to do with
>not breaking the client parsing of the MIME Content-Type. If the server
>always slapped charset on, current clients would parse the header:
>
> Content-Type: text/html; charset=ISO8859-1
>
>and think the content type was the entire 'text/html; charset=ISO8859-1'
>not just 'text/html' string and would fail to read Latin1 files!!!!!
>
>To be backwards compatible, the servers should not send the charset
>parameter to old browsers.

In the case of latin1 text, I agree. Don't send the charset parameter
explicitly. Rely on the specified default. But for japanese text,
why not send the charset= parameter?

>I'm still assuming that
>
> accept-parameter: charset=xxx
>
>dictates if the server sends back the charset parameter. An old browser
>should continue to get the 2022-JP data untagged.

Why?

>Daniel>As above, the server is bound to give "charset=ISO2022-JP" if it is
>Daniel>not returning latin1 data. So the client does know that it's not
>Daniel>getting latin1 data. It has the responsibility to interpret the
>Daniel>charset correctly, or save the data to a file or report "sorry, I
>Daniel>don't grok this data" to the user. If it blindly blasts ISO2022-JP
>Daniel>tagged data to an ASCII/Latin1 context, then it's broken.
>
>[...]
>But what the client does in this situation should be implementation
>dependent and not part of this proposal.

Right. I should have said "it has the responsibility to display
the data correctly, or save to a file, or whatever..."

>Ken>I want to add one more thing about this issue. We could have the document
>Ken>which uses multiple charset in future. We must define the way to label
>Ken>such a document.
>Ken>It can be like ...
>Ken> Content-Type: text/html; charset="ISO2022-JP", charset="ISO8859-6"
>Ken>Is this OK?

No. If you use characters from different character sets, there is
still one character encoding, that is mapping of octets to characters,
that applies to the whole document. You have to specify that
mapping. If it doesn't have a registered name, you're out of luck.
The alternative is an unoly mess like the SGML declaration. No thanks.

>I'd rather leave this as a possible future direction.

Ok... we can stick this issue on the shelf rather than in the trash.
But I'm pessimistic :-}

Dan