Re: ISO/IEC 10646 as Document Character Set

Dan Connolly (connolly@w3.org)
Sat, 29 Apr 95 00:41:33 EDT

Erik van der Poel writes:
> I.e. is the charset allowed to be iso-2022-jp (or any other non-Latin-1
> and non-10646/Unicode charset), and are you still allowed to use 10646
> numeric entities within such documents?

Short answer: yes.

Long answer:

> Just to clarify things in my mind, would the following be allowed
> in your world?

OK. Good. I like specific examples. They tend to elucidate a lot of
subtleties.

> HTTP headers followed by HTML document:
>
> HTTP/1.0 200 OK
> Date: Saturday, 29-Apr-95 03:53:33 GMT
> Server: ...
> MIME-version: 1.0
> Content-Type: text/html; charset=iso-2022-jp
> Last-modified: Tuesday, 18-Apr-95 16:10:13 GMT
> Content-length: 15132
>
> <TITLE>...</TITLE>
> <BODY>
> Here is some normal text.
> Here is a 10646 numerical entity: &#23598732;.
> Here is some ISO-2022-JP text: ...
> </BODY>
>

OK... so what we have above is an HTTP response, which is a response
line followed by what's called (in MIME and HTTP) a message entity.

To interpret the message entity, you look at the Content-Type. It
says "text/html". So you look at the html spec. My working draft (to
be release ASAP!) says:

|3.2 HTML Document Representation
|
| A message entity with a content type of "text/html" represents an HTML
| document, consisting of a single text entity. The charset parameter
| (whether implicit or explicit) identifies a character encoding. The
| text entity consists of the characters determined by this character
| encoding and the octets of the body of the message entity.

So we take the charset parameter, iso-2022-jp, and we use that
to map the octets of the body of the message entity to a sequence
of characters.

During this step, the octets represented by '...' in:

> Here is some ISO-2022-JP text: ...

turn into characters. Nothing surprising happens to this stuff yet:

> Here is a 10646 numerical entity: &#23598732;.

OK. Now we have a text entity: a sequence of characters. To parse
as per ISO8879, we need to know the document character set. In
the internationalization document (which I don't have handy... sorry)
we're specifying that the document character set for HTML is ISO10646.

So to interpret:

> Here is a 10646 numerical entity: &#23598732;.

We look up 23598732 in the ISO10646 specification, and see what
character it maps to.

Simple, no?

> If this is allowed, I agree that this would be a good way to migrate
> to the Brave New World of 10646.

One by one, we're all coming to this very conclusion.

Dan