Re: SGML/MIME charsets, ERCS/Unicode [was: New DTD (final version?) ]

Roy T. Fielding (fielding@avron.ICS.UCI.EDU)
Sat, 11 Feb 95 01:23:37 EST

> (This message is probably counterproductive but...)

This entire thread is counterproductive. To quote from RFC 1603
(IETF Working Group Guidelines):

A working group is typically created to address a specific problem or
produce a deliverable (a guideline, standards specification, etc.)
and is expected to be short-lived in nature. Upon completion of its
goals and achievement of its objectives, the working group as a unit
is terminated. Alternatively at the discretion of the IESG, Area
Director, the WG Chair and the WG participants, the objectives or
assignment of the working group may be extended by enhancing or
modifying the working group's charter.

The charter for this working group is at
<http://www.ietf.cnri.reston.va.us/html.charters/html-charter.html>.

Setting a standard for the internal characteristics of WWW browsers and
servers is not within the scope of our charter.

>>Finally, regarding character set issues.... they don't belong here.
>>HTML should be defined independently of the document character set to
>>whatever extent is possible under SGML.
>
> That is what ERCS is all about.

Fine. What specific changes to the HTML 2.0 specification are required
to enable HTML to be defined as an SGML application that uses ERCS,
such that an HTML document encoded in ISO-8859-1 remains legal (as
defined by the standard)? What impact will this change have on existing
practice?

That's all you have to say -- there is no need to convince the WG that
Unicode is good, that ERCS is good, that balkanization is bad, or that
the sky is falling! Just give us something that can be put in a spec
without violating the basic principles of our charter.

>>Under no circumstances will this group ever require that Web clients
>>and/or browsers use a specific character set other than ISO-8859-1 --
>>making it easy to use other character sets is desirable, but defining
>>a lingua franca is absolutely out of the question for this working
>>group.
>
> Why? Why choose ISO-8859-1? Why ignore the fact that we live in a
> multi-lingual world and bury our heads in the sands of ignorance and
> denial?

Why say such silly things on a public list? As stated below, ISO-8859-1
was chosen by the team at CERN when they implemented HTML for the WWW.
If you had bothered to read what I wrote, you would have noticed that it
means the same as "this WG will not require any specific character set,
unless we are forced to require one, in which case it will be ISO-8859-1."

This is a historical fact -- we can grunt and scream and grumble 'til
hell freezes over, and it still won't change that fact.

If you think the 2.0 specification unnecessarily states that ISO-8859-1
is required, then please indicate the specific section of the specification
which is in error and provide corrective text which will solve the problem.
The same goes for places where you think the wording can be eased to allow
any given character set without effecting currently valid implementations.

Our only requirement regarding character sets is that
a) it be legal SGML
b) a document properly marked-up in ISO-8859-1 text be legal HTML.

>>The same goes for HTTP -- it should be possible to transmit documents
>>in any character set using HTTP. Under no circumstances will the
>>http-wg ever require that Web clients and/or browsers use a specific
>>character set other than ISO-8859-1.
>
> Why not? Because *you* say so? Do you have the authority, or the
> audacity to make an arbitrary decision which could potentially cripple
> the interoperability of the WWW for years to come? To make a decision
> that could affect hundreds of thousands of people?

Yes, in the case of HTTP 1.x, I have both the authority and the audacity
to do so, unless the HTTP WG comes to consensus on a differing solution.
However, since that is beyond the scope of the HTTP charter, it seems
unlikely that the design decision of the HTTP/1.x group will be changed.
Nor is there any need for such a change. The protocol has been designed
to allow ANY character set to be used. There is absolutely no reason for
the protocol to require any character set except within the protocol
elements of the headers and commands (which are defined as US-ASCII for
maximum interoperability as an Internet protocol).

> At the very *least* you could offer a nice clean solution.

I just did.

>>The reason ISO-8859-1 is required is because at least one character set
>>must be required, and ISO-8859-1 was the most appropriate 8-bit,
>>ASCII-inclusive set when the web was invented.
>
> Oh. ASCII is god's gift to mankind I assume? You should read the
> scripture regarding Bable...

US-ASCII is the lingua franca of the Internet. It is the only character
set guaranteed to be understood by all Internet hosts, regardless of
national origin or local language. ISO-8859-x are special for Internet
mail because they represent supersets of US-ASCII.

And you should read the design issues of the WWW project. See
<http://www.w3.org/hypertext/WWW/DesignIssues/Overview.html>.
Note that they were mostly written 3 years ago.

>>If you want to talk about lingua franca's and
>>what-the-parser-should-do and the future of the web, etc., it should
>>be done on www-talk. Setting standards for internal browser and
>>server implementations is not a job for the IETF.
>
> Oh? And exactly what do you think you are doing? Saying "well folks,
> here's a nifty idea, but hey, you don't really have to do this. I
> mean, this is all just an idea after all."

Read: "internal" vs "external". The IETF sets standards (usually interface
or formatting standards) for the Internet as a whole. They specifically
do not specify implementations, only the output of implementations.

>>If people are looking for something to fight for regarding Unicode,
>>let me suggest that they first get the three (4?) variations of Unicode
>>registered with IANA such that I can include their official names in
>>the HTTP/1.0 specification. It's damn difficult to provide for
>>character set negotiation when there is no single standard for the
>>character set name.
>
> In my list there are:
> ISO-10646-UCS-2
> ISO-10646-UCS-4
> ISO-10646-UTF-1
> UNICODE-1-1
> UNICODE-1-1-UTF-7
>
> Which are really 2 different character sets, and 5 different
> encodings.

None of which are registered with IANA as valid for use with an Internet
protocol (see MIME). So, register them. Then, I can include the official
names in the HTTP/1.0 specification and we will be one step closer to what
you desire.

> May I suggest that you go back to you isolationalist world, study a
> little about character sets, encodings, multilingual issues, SGML, and
> then decide whether to come back and play in the global sandbox, or
> whether you should just bury your head deeper in the sand alluded to
> earlier.
>
> Your attitude is inexcusible, irresponsible, and verges very close to
> incorrigible.
>
> ---
> Gavin "Easily angered by bigots at 2am" Nicol
> NOT speaking for EBT!

And you are a fool. Given that I am the person who made the decision
that HTTP was not a MIME-conforming protocol for the EXPLICIT REASON that
such a decision was necessary to allow the use of Unicode and the
Content-Encoding header, I believe you owe me a public apology.
Furthermore, since I have spent most of the last three months defending
those decisions on three separate WG mailing lists to which I know you
subscribe, I think you could show a little more courtesy and at least
attempt to understand what I write before you respond to it, especially
when that response is written at 2am.

.....Roy Fielding ICS Grad Student, University of California, Irvine USA
<fielding@ics.uci.edu>
<URL:http://www.ics.uci.edu/dir/grad/Software/fielding>