The pragmatic proposal below is driven by a need to meet an existing
(and pressing) need and a desire not to a derail long-term multilingual
solutions.
My assumptions:
o There's lot of Japanese Web pages in ISO2022-JP that we need to
be able to browse today.
o Lots more non-Latin1 text files will be (or already are) created
on the Web.
o Changes must not require changes of Web file contents.
o Unicode in one form or another will be used for future Web pages.
o New clients should not break existing servers and new servers
should not break existing clients (backwards-compatibility).
Comments, please!
-bob
============================================================================
"Accept-Charset" and "charset" Support for Web Browsers
In order to render single language (actually single character set encoding)
text files on the Web correctly, a mechanism is needed to identify the
character set encoding per text file. For example, files encoded in
ISO2022-JP should be rendered as Japanese and files encoded in ISO8859-1
should be rendered as Latin characters. Currently, there is no
deterministic way to know the character set encoding of a text file.
The MIME content type header provides a mechanism for this by means of the
"charset=xxx" parameter.
For example:
Content-Type: text/plain; charset=ISO_8859-1:1987
or
Content-Type: text/html; charset=ISO-2022-JP
The problem is that many browsers today do not parse for parameters and
will be confused by the above examples. Some browsers will take the
entire string "text/plain; charset=ISO_8859-1:1987" instead of "text/plain"
as the content type.
Therefore, I suggest that charset-parameter-savvy browsers, send servers a
new accept header, "Accept-Charset". This would look like:
Accept-Charset: ISO_8859-1:1987
Accept-Charset: ISO-2022-JP
The "Accept-charset" header was proposed by Gavin Nichol in a document
sent to the several mailing lists (html.wg, http.wg, www.unicode),
"Handling Multilingual Documents in the WWW". See
http://www10.w3.org/hypertext/WWW/Administration/Mailing/Outside_mailing.htm
l
I propose that servers only send the MIME charset parameter if it has
received an "Accept-Charset" from the browser. This convention will
prevent compatibility problems with current browsers.
The charset-parameter-savvy browsers should send "Accept-Charset" headers
for the charsets they recognize.
The "Accept-Charset" header should NOT restrict servers from sending text
files in other charsets. It is the browsers' responsibility to handle
unsupported charsets gracefully.
If the browser receives text files without charset information, then the
behavior will be implementation dependent. In this case, I suggest that
the browser use a per-window default. This allows knowledgeable users to
read Japanese newsgroups in one browser window and French newsgroups in
another one, even if the charset is not specified in the headers. Ideally,
all text files will provide charset headers, but the per-window default
would provide users with a means to deal with unidentified text data.
It is the browsers responsibility to know how to render the text file
correctly. This may require converting from the character set encoding of
the file to another internal character set encoding. Whether this internal
encoding is Unicode or some other encoding is implementation dependent.
In the future, there may be HTML tags that specifies character set encoding
at a finer granularity (i.e., per-string vs. per URL). These HTML tags
may be required to implement multilingual HTML documents. When (or if)
these tags exist, they would take precedence over the MIME charset header
information.
However, the MIME charset header information will remain useful for new and
(especially) existing "single" language documents. The MIME charset
information will allow existing documents to be rendered correctly without
modifying their contents by adding new HTML tags.