Re: changes to HTML draft re: character sets

Daniel W. Connolly (connolly@hal.com)
Mon, 16 Jan 95 12:46:15 EST

In message <95Jan16.005738pst.2760@golden.parc.xerox.com>, Larry Masinter write
s:
>I had a few comments back on the proposed changes. I'm still fuzzy,
>though, on what the process is for actually getting the draft edited.
>Can someone remind me as to who is editing what?

The document is maintained at Spyglass. Submissions are processed
by hand, right Eric?

> Rather than restricting its value, we can avoid limiting
>the value of the charset parameter but note that the allowable values
>for "charset" may be limited by the context. This pushes the WWW
>problem to HTTP-WG, and makes HTML 2.0 neutral on the issue. (That
>is, the spec is compatible with current practice but also describes a
>legal extension mechanism for going beyond current practice.)

This sounds like the way to go, to me.

The critical distinction is that existing implementations are largely
_compatible_ with this specification, even though they don't implement
all of it. For example:

Existing clients that see:

Content-Type: text/html

blah blah bla in US-ASCII

handle it consistent with the MIME spec: they display the right
US-ASCII characters.

The fact that HTTP servers currently send:

Content-Type: text/html

blah blah in ISO8859-1, with accented characters

isn't quite conforming to the MIME spec, but it's a reasonable
extension, given that the default Content-Transfer-Encoding in HTTP is
binary.

Existing clients that see:

Content-Type: text/html; charset="ISO-2022-JP"

blah blah blah with escape sequences and japanese data

behave in a conforming manner in that they treat it as an unrecognized
content type, and they treat it like application/octet-stream (e.g.
they offer to save it to a file.)

The specification should include some sort of NOTE: to cover the
two current practices that are broken:

1. Most existing clients misbehave when seeing:

Content-Type: text/html; charset=US-ASCII

blah blah bla in US-ASCII

or
Content-Type: text/html; charset=ISO-8859-1

blah blah in ISO8859-1, with accented characters

They are bound to handle it just like they handle unadorned "text/html",
but they do not.

2. Some existing servers send:

Content-Type: text/html

blah blah ISO2022-style escape sequences etc.

This has no specified meaning for conforming clients.

Daniel W. Connolly "We believe in the interconnectedness of all things"
Software Engineer, Hal Software Systems, OLIAS project (512) 834-9962 x5010
<connolly@hal.com> http://www.hal.com/%7Econnolly