Revised language on: ISO/IEC 10646 as Document Character Set

Dan Connolly (
Fri, 5 May 95 21:45:02 EDT

Martin J. Duerst writes:
> >>
> >> |HTML Lexical Syntax
> >> |
> >> | ... A minimally conforming HTML user agent must support the SGML
> >> | declaration in section SGML Declaration for HTML, which specifies ISO
> >> | Latin 1 (@@full name) as the document character set; it may support
> >> | other SGML declarations, in particular, SGML declarations with other
> >> | document character sets.
> Why not write it like this (another compromize):
> "in particular, SGML declarations with ISO10646 as the document
> character set."

Right. Try the latest version on for size:

Blech. Lemme try again.... OK. That's better.

I moved this discussion into the conformance section (it took me a
while to find it where it used to be: under "Lexical syntax"). That
way, the "Character Content" and "Document representation" parts don't
have to change if/when we revise the whole thing or excerpt parts for
other documents.

I actually make ISO10646 a binding constraint without putting it
in the public text (the SGML declaration). See what you think:
|A document is a conforming HTML document only if:
|Its document character set includes ISO-8859-1 and agrees with
|ISO10646; that is, each code position listed in section The ISO-8859-1
|Coded Character Set is included, and each code position in the
|document character set is mapped to the same character as ISO10646
|designates for that code position. (1)
|The document character set is somewhat independent of the character
|encoding scheme used to represent a document. For example, the
|ISO-2022-JP character encoding scheme can be used for HTML documents,
|since its repertoire is a subset of the ISO10646 repertoire. The
|crititcal distinction is that numeric character references agree with
|ISO10646 regardless of how the document is encoded.
|User Agents
|An HTML user agent conforms to this specification if:
|It supports the ISO-8859-1 character encoding scheme, and processes
|each character in the ISO Latin Alphabet Nr. 1 as specified in section
|The ISO Latin 1 Character Repertoire. (3)
|To support non-western writing systems, HTML user agents should
|support the Unicode-1-1-UTF-8 and Unicode-1-1-UCS-2 encodings and as
|much of the character repertoire of ISO10646 as is possible as well.

How's that for a compromise?

(note that the text and postscript versions are a bit out of date
right now...)