Re: ISO/IEC 10646 as Document Character Set

James Clark (jjc@jclark.com)
Sat, 20 May 95 09:38:22 EDT

> Date: Wed, 17 May 95 01:05:33 EDT
> From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU>

> However, Glenn's quote from the SGML Handbook, p. 487, section
> 15.6 System Declaration:
>
> "A system declaration is the complement of an SGML declaration.
> While an SGML declaration identifies the features that a parser
> requires in order to deal with a particular document, the system
> declaration identifies the set of SGML declarations that a system
> can deal with."
>
> would indicate (to me) that an ordinary BASESET of ISO/IEC 10646
> requires that conforming parsers be capable of dealing with characters
> greater than 255.

This is not correct. The system declaration would only be relevant if
HTML required that conforming HTML user agents were conforming SGML
systems.

Using ISO/IEC 10646 as the document character set in the SGML
declaration means that a conforming HTML document can be allowed to
contain bit combinations > 255.. The HTML spec is free to say that a
conforming HTML implementation need not be able to process conforming
HTML documents containing bit combinations > 255.

> Therefore, I would not mind changing the 2.0 declaration to specify
> ISO/IEC 10646 as the BASESET, but only if characters > 255 are
> marked as UNUSED.

That would be totally bogus. In SGML,

BASESET "ISO Registration Number 176//CHARSET
ISO/IEC 10646-1:1993 UCS-2 with implementation level 3//ESC 2/5 2/15 4/5"

DESCSET 128 32 UNUSED
160 96 32
256 65280 UNUSED

means exactly the same as:

BASESET "ISO Registration Number 100//CHARSET
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"

DESCSET 128 32 UNUSED
160 96 32
256 65280 UNUSED

It would also mean that the SGML declaration would be assigning no
meaning to numeric character references > 255; in particular, it would
*not* be saying that they are to be interpreted in ISO/IEC 10646.

Also only those bit combinations that have been declared in the
document character set with a meaning other than UNUSED can occur
directly in SGML entities. This means that, with your suggested SGML
declaration, no conforming HTML document could directly represent
(that is, represent without some sort of character or entity reference
or code extension technique) any character outside of ISO 8859-1.

James