Re: Revised language on: ISO/IEC 10646 -- another proposal

Bert Bos (
Fri, 12 May 95 06:16:26 EDT

Albert Lunde <> writes:

|My proposal:
|= = =
|A document is a conforming HTML document only if:
|Its document character set includes ISO-8859-1 and agrees with ISO10646 for
|all characters and code positions that they have in common. That is:
|1) each code position listed in section The ISO-8859-1 Coded Character Set
|is included.
|2) All code positions that are used in the document character set and are
|also used in ISO10646 must map to the same characters as they map to in
|3) All characters that are in the intersection of the character repertoires
|of the document character set and ISO10646 must be mapped to by at least
|one code position used in ISO10646.
|= = =
|Optionally, as an explainatory note:
|= = =
|ISO10646 is used in this way to provide a consistent SGML intepretation of
|numeric character references over a large range of characters and encoding
|schemes. These conditions places very little constraint on the character
|encoding (specified by the MIME charset parameter in HTTP, or by other
|external means in other contents.)
|This standard does not exclude the use of a document character set
|containing characters not in ISO10646, but it does not completely specify
|how to choose code positions for such characters. Use of numeric references
|to such characters may therefore raise problems of interoperability outside
|the scope of this document.
|= = =
|I haven't tried to rewrite Dan's other notes to be consistent with this
|I would like comments on if this addresses various objections.
|It seems to me this preserves a couple of properties of Dan's proposal:
|- It allows the use of ISO-8859-1 as a document character set.
|- Numeric references that refer to code positions in ISO10646 must map to
|the same characters as ISO10646.
|- Any character in ISO10646 that's in the document character set can be
|translated to some ISO10646 numeric reference.
|On the other hand, it allows the construction of document character sets
|that are supersets and extensions of ISO10646 by adding code position
|beyond its range or using unused positions.
|We may want to add some further condition that the document character set
|only uses additional code positions which are "safe" in some sense of what
|ISO10646 has designated for private use vs. future expansion. I don't know
|enough about ISO10646 to word this correctly.

I don't quite understand. The points (2) and (3) above seem to
conflict. If I try to reformulate the explanation in my own words:

1. All characters used in the document that are in also the ISO
8859-1 repertoire must have the same code numbers as in ISO

2. All characters used in the document that happen to be refered to
by a numeric character entity must have the same code numbers as
in ISO 10646.

3. Any remaining characters (i.e., those not in the ISO 8859-1
repertoire and never occuring in the form of a NCR), may have
arbitrary codes >255.

Presumably the charset parameter of HTTP will be used to identify the
mapping of (3).

If this is correct, then this is awful! Where are clients going to get
the mapping tables needed for (3)? Moreover, I don't see any practical
way for the author to satisfy (2), except in the two trivial ways:
either use ISO 10646, or don't use NCRs.


                          Bert Bos                      Alfa-informatica
                 <>           Rijksuniversiteit Groningen
    <>     Postbus 716, NL-9700 AS GRONINGEN