Re: Revised language on: ISO/IEC 10646 as Document Character Set

Albert Lunde (
Wed, 10 May 95 23:42:23 EDT

> In any case, I agree with you that the HTML spec need not require the
> set of processable data characters to be limited to those in the document
> character set.

All right, lets step back and see how we got here...

An earlier draft had some language about other MIME charsets and
numeric references which raised some questions in review.

Parallel to this there was some discussion of ERCS and Unicode.

As a solution to the numeric references problem for other
MIME charsets (among other things) it was suggested to use
fixed document character set of ISO 10646.

We discussed putting this in the HTML 2.0 spec vs. writing a
statement of future direction, and Dan Connally proposed
the present form as a way of keeping ISO-8859-1 in the
SGML declaration while constraining future development
to be consistent with the use of ISO 10646 as a document
character set for all character encodings:

It sounds like while we may have solved the numeric references
problem for "small" character encodings we are colliding with
other issues on "big" character encodings.

The language about "subsets" of ISO 10646 is there at least in
part to make it legal to use document character sets like
those based on the mapping into ISO 10646 of ISO-8859-X
and of course ISO-8859-1.

If it is true, as a couple of people have asserted, that we
can have data characters in the character encoding scheme
that are not in the document character set, we may have
two other alternatives:

1) Say that the document character set for HTML 2.0 is ISO 10646
but that folks are only required to support the subset
ISO-8859-1 (and bring back the discussion of what support
for that subset means). Don't try to place binding constraints
on all future versions, but say that use of ISO 10646 as the
document character set is the direction we are going.

2) Say that developers may use any document character set
whose numeric references are consistent with ISO 10646 for all the characters
that the two sets have in common. This preserves the "nice" properties
we want for "small" character sets, while giving an escape hatch
for those who want to introduce characters not in Unicode.

I'm not sure I know SGML and other jargon well enough to suggest a new
wording that is precise and also makes it clear that we are
not talking about making character encodings agree with Unicode,
only about making numeric references agree.
(and not talking hardly at all about MIME charset parameters.)

But, this second approach seems consistent with Dan's general direction
of trying to leave ISO 10646 out of the HTML 2.0 SGML declaration
but express explicit constraints on the next step.

Would either of these directions (together with tags for language
markup waiting in the wings) address the objections raised so far?

Are there other issues with them?

    Albert Lunde