Re: ISO/IEC 10646 as Document Character Set

Dan Connolly (connolly@w3.org)
Fri, 5 May 95 17:02:40 EDT

Glenn Adams writes:
>
> What are you talking about? I can put &#2789 in a document today
> without violating SGML conformance. The appearance of non-SGML
> characters in a document is not a reportable markup error (according
> to ISO 8879 4.267), and, therfore, does not produce non-conformance.

This is news to me. SGMLs says:

sgmls: SGML error at -, line 8 at ";":
Numeric character reference exceeds 255; reference ignored
sgmls: SGML error at -, line 9 at ";":
Numeric character reference exceeds 255; reference ignored
sgmls: SGML error at -, line 10 at ";":
Numeric character reference exceeds 255; reference ignored
sgmls: SGML error at -, line 11 at ";":
Numeric character reference exceeds 255; reference ignored

I don't know if sgmls is reporting an error in the document or an
internal limitation. The error message suggests that the error
is in the document.

Hang on: none of '&', '#', '2', '7', '8', nor '9' is a non-SGML
character. This is markup composed of SGML characters. The question
is: is it legal markup?

Section 9.5, "Character Reference" says that a numeric character
reference should be treated just like the character it references. But
if the number isn't in the domain of the document character set, what
character does the reference refer to? I'd say this is a reportable
markup error.

> As for "doesn't work",
> I'm not sure what you mean by "work". What does "work" mean?

Exactly. We all know what "work" means for each of the characters
in ISO-8859-1. It's listed explicitly in:

"The ISO-8859-1 Coded Character Set"
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_14.html#SEC87

and discussed in:

"The ISO Latin 1 Character Repertoire"
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_5.html#SEC32

I'm worried by what it will take to specify "work" for all of ISO10646.
I believe it is worthy of its own document.

> The issue is whether it breaks anything or changes the language.

The issue is also shared understanding. There is a large shared
understanding of how ISO-8859-1 and HTML relate. This shared
understanding regarding HTML and ISO10646 needs to be built over a
period of time.

Gavin's document makes a nice start at an internationalization
document. Folks should pick it up and implement it. Get a feel for the
issues. Hash it over. But separate this debate from the HTML 2.0
document, which was hashed out from some time in 1990 to May 1994.

Since then, we've fought hard to get the standardization process in
place without worrying about new features. New features need to get
specified, but not in the HTML 2.0 document. Non-western writing
systems are a new feature -- perhaps a whole new set of features.

I remain unconvinced that now is the time to put ISO10646 in HTML 2.0,
other than as an appendix or note.

Dan