Re: Revised language on: ISO/IEC 10646 -- another proposal

Martin J Duerst (mduerst@ifi.unizh.ch)
Fri, 12 May 95 07:08:27 EDT

>
>Albert Lunde <Albert-Lunde@nwu.edu> writes:
>
> |My proposal:

In which a pretty disoriented Bert Bos saw all kinds of
inconsistencies and problems.

I don't know what makes the issues here so difficult for
some, but there are no major problems at all.

> |A document is a conforming HTML document only if:
> |[...]
> |Its document character set includes ISO-8859-1 and agrees with ISO10646 for
> |all characters and code positions that they have in common. That is:
> |
> |1) each code position listed in section The ISO-8859-1 Coded Character Set
> |is included.
> |
> |2) All code positions that are used in the document character set and are
> |also used in ISO10646 must map to the same characters as they map to in
> |ISO10646.
> |
> |3) All characters that are in the intersection of the character repertoires
> |of the document character set and ISO10646 must be mapped to by at least
> |one code position used in ISO10646.

>
>I don't quite understand. The points (2) and (3) above seem to
>conflict. If I try to reformulate the explanation in my own words:
>
> 1. All characters used in the document that are in also the ISO
> 8859-1 repertoire must have the same code numbers as in ISO
> 8859-1.
In case they are represented with NCR, yes. Otherwise, they can
have any code number and octet representation whatsoever in
the subset of the MIME "charset" they are transmitted.

> 2. All characters used in the document that happen to be refered to
> by a numeric character entity must have the same code numbers as
> in ISO 10646.
In case a character transferred is represented by a NCR and is
contained in ISO 10646, the code number of ISO 10646 has to be
used in the NCR.

> 3. Any remaining characters (i.e., those not in the ISO 8859-1
> repertoire and never occuring in the form of a NCR), may have
> arbitrary codes >255.
Any remaining characters, i.e. those outside ISO 10646, may
occur in the form of NCR, but have to use numbers that
don't collide with ISO 10646, e.g. from the private section
in the BMP (base multilingual plane, first 2^16 characters).

>Presumably the charset parameter of HTTP will be used to identify the
>mapping of (3).
As far as we are concerned, nothing is defined for (3).
It's a completely private business so far.

>If this is correct, then this is awful! Where are clients going to get
>the mapping tables needed for (3)?
Usual clients won't need them. ISO 10646 covers a really
large character set. And even if you don't restrict you to the
characters in ISO 10646, you won't need them as long as
these characters are directly encoded, without any NCR.
In these cases (assume the server and browser can agree
on a common MIME "charset"), you can go directly from
the "charset" to the system encoding, and all the above is
just theoretical background.

Regards, Martin.