Re: Revised language on: ISO/IEC 10646 as Document Character Set

Dan Connolly (connolly@w3.org)
Tue, 9 May 95 15:23:59 EDT

Erik van der Poel writes:
> > > > Does the HTTP charset have to be a subset of 10646?
> > > >
> > > >No. It can be anything. Making 10646 the doc charset doesn't place
> > > >any requirements on the HTTP charset.
> > >
> > > Except that all the characters in the document have to be within ISO
> > > 10646.
> >
> >I think I missed this too: We were talking about HTML, but the
> >above question is about HTTP. Nobody means to restrict the charset
> >parameter values for all of HTTP, right?
>
> Implementors need to know the relationship between the HTTP/MIME
> "charset" parameter and the HTML "document character set".

OK... now you're talking about HTML. What I meant is that we don't
mean to keep anybody from sending this message entity via HTTP:

Content-Type: text/x-my-own-media-type; charset=x-my-own-encoding

lajslir4liwj34liwj34l

But we _do_ mean to restrict the way HTML works over HTTP.

That relationship is specified as follows:

"HTML Document Representation"
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_3.html#SEC17

|A message entity with a content type of `text/html' represents an HTML
|document, consisting of a single text entity. The `charset' parameter
|(whether implicit or explicit) identifies a character encoding
|scheme. The text entity consists of the characters determined by this
|character encoding and the octets of the body of the message entity.

Hmmm... that doesn't really tell you what the document character
set is. This bit might help there: Under Conformance...

"Documents"
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_1.html#SEC4

|Its document character set includes ISO-8859-1 and agrees with
|ISO10646; that is, each code position listed in section The ISO-8859-1
|Coded Character Set is included, and each code position in the
|document character set is mapped to the same character as ISO10646
|designates for that code position.

This still doesn't exactly tell you what the document character set
is. At some point, the receiving system has to pick an SGML
declaration to stick in front of the received document instance for
parsing. The HTML 2.0 spec supplies one with ISO-8859-1 as the
document character set, and user agents "should" use it.

Hmm... seems like a clarification might help. Any suggestions?
Or is what's in the spec good enough?

> I don't care whether you put that in the HTML spec or somewhere else.
> But it has to be put somewhere.

OK. So I'm saying it has been put somewhere. Does this address
your concern, or would you like to suggest something more?

> Glenn seems to agree that the charset does not have to be a subset
> of 10646.

Er... well... no. Glen says you can have characters in your document
that are not in the document character set. I don't really agree, and
it seems that neither does James Clark nor Charles Goldfarb. Anyway, I
believe that he meant to make a point about the SGML specification,
and not about the way HTML will be used.

The consensus I've heard is that ISO10646 is "good enough" and that
nobody is interested in HTML-based communications using characters
that are not in ISO10646. Could you give a motivating example of the
sorts of things that you want to do, and tell me whether you think it
conflicts with the current wording of the HTML 2.0 document?

> Can we remove the word "subset" from that part of your
> spec please. Or are you referring to something other than the charset
> in the following:
>
> The document character set is somewhat independent of the character
> encoding scheme used to represent a document. For example, the
> ISO-2022-JP character encoding scheme can be used for HTML documents,
> since its repertoire is a subset of the ISO10646 repertoire. The
> crititcal distinction is that numeric character references agree
> with ISO10646 regardless of how the document is encoded.

I was referring to the charset in a way: The word "subset" is used
above to describe the relationship between two character repertoires,
i.e. two sets of characters: the set of characters that can be encoded
using ISO-2022-JP, and the set of characters that have code positions
in ISO10646. The conformance language above is meant to restrict the
values of the "charset" parameter to those character encoding schemes
whose repertoire is a subset of the ISO10646 repertoire.