Re: Revised language on: ISO/IEC 10646 as Document Character Set

Albert Lunde (Albert-Lunde@nwu.edu)
Tue, 9 May 95 13:23:25 EDT

>% >Does the HTTP charset have to be a subset of 10646?
>% Yes.
>% >Why not just remove the restriction that the HTTP charset has to be a
>% >subset of 10646? I.e. remove the word "subset" somehow.
>% Because we cannot have ISO 10646 as the document charcater set then,
>% and we move back to square one without passing GO.
>I am lost. I believed that HTTP charset should be an 8 bit one.
>Couldn't we say that the HTTP charset may be either ISO 8859-1
>(or maybe each of the ISO 8859-x) or a suitable *encoding* of 10646 (there
>should be such a beast)? And that HTML browsers are requested to recognise
>all of these and to be able to display just a limited set of charsets?

I think you are confused.

There is nothing in the requirement that ISO 10646 be the document
character set that precudes using an eight-bit encoding like ISO Latin-1.

Let me try a "layman's" explaination, as I understand it.

The SGML document character set specifies the characters that will be
recognized by the SGML parsers. You can think of this as the internal
character representation used by SGML (though an implementation may do this
differently). This has nothing to do with the way characters go over the
wire.

Except for the question of resolving numeric references, what is
significant to SGML is mainly what characters are allowed in this set and
if they are markup or data.

The MIME/HTTP charset parameter specifies the name of a character encoding,
that is an actual mapping of octets (or groups of octets etc) going over
the wire into character names or glyphs.

The first requirement we are making is that all the characters in the range
of this function must correspond to characters in ISO 10646; this is a very
liberal requirement that seems to be true of nearly all real character
encodings in use. (Even if it's not true we can't do much better here!)

We are also requiring that numeric references in HTML be interpreted
according to the corresponding positions in ISO 10646, NOT the position in
the current HTTP character encoding or some other misc. character set.

This makes the SGML nicer and ensures that numeric references will be
consistent across all encodings. This is the real signficance of talking
about using an SGML document character set that is a subset of ISO 10646;
it has little to do with the HTTP encoding.

We are not addressing all the questions of what it means to support a
subset of ISO 10646 here, just saying "if you want to go beyond Latin-1
play by these rules...".

---
    Albert Lunde                      Albert-Lunde@nwu.edu