Re: HTML Character Representation/Transmission Model

Glenn Adams (glenn@stonehand.com)
Tue, 11 Apr 95 09:00:37 EDT

Lou,

The key to this proposal is that specifying 10646 as a universal
HTML document character set is, in general, simply an editorial
change w.r.t. current practice (as Chris Lilley points out).

First, this change does not require 10646 be used in any storage
object; that is, the current practice of using 8859-1 could continue
without change.

Second, all numeric character references in the range 0 - 255 would
refer to both 10646 characters in this range and also 8859-1 characters
in this range since the first 256 code positions of 10646 *are* precisely
the same as 8859-1 in terms of their character assignments (and formal
identities).

Finally, if a numeric character reference specifies a character number
greater than 255, then such a character could be interpreted in some
default fashion or not intepreted as a character at all on existing
8859-1 systems.

The key to successfully making this change would be to limit the
"significant SGML characters" in the default SGML declaration to
only those characters of 10646 which are also found in 8859-1. The
remaining characters of 10646 would thus be treated as "dedicated
data characters" (of class DATACHAR).

Gavin's attempts to use an SGML declaration which admitted non 8859-1
characters to use for markup would have to be deferred to a later date.

The real effects of this change would be to:

(1) rationalize the use of numeric character references in a universal
fashion (at least normatively speaking)

(2) provide a significant growth path to HTML applications that wish
to begin exploiting non-Western European (8859-1) language capabilities,
and do so in a standard fashion

(3) facilitate the use of DSSSL Lite and DSSSL (ISO/IEC 10179) which
requires that all characters be expressable in terms of ISO/IEC 10646

(4) provide more consistency with newly developed national standards;
e.g.,

JIS X 0221 = Japanese National Standard based on ISO/IEC 10646
GB 13000 = Chinese National Standard based on ISO/IEC 10646
ECMA ? = upcoming ECMA standard based on ISO/IEC 10646
etc.

(5) finally, this change would *not* necessarily change current
behavior or practice

Regards,
Glenn