Re: ISO/IEC 10646 as Document Character Set

James Clark (jjc@jclark.com)
Mon, 8 May 95 08:09:56 EDT

> Date: Mon, 8 May 95 02:39:42 EDT
> From: Gavin Nicol <gtn@ebt.com>

There's really no need to worry about the system character set.

> NOTE--It is recognized that the recipient of a document must be able
> to translate it to his system character set before the document can
> be processed by machine. There are two basic approaches to
> communicating this information.
> . . .
> As the last not implies, the document character set parameter is
> ignored by the SGML parser because the document is already in the
> document character set. The parameter is intended for a human to
> read in printed form, in order to determine how to translate an
> incoming document to the local system character set.
>
> The actual translation process from a document character set to the
> system character set is not defined, so we have 2 ways to interpret
> these notes:
>
> 1) That all characters in the document must also be available to the
> system, and a simple one-to-one translation is performed.
> 2) That the translation process can perform arbitrary translations.
>
> Also, the system representation is undefined, providing another grey
> area.

When the SGML standard was written, I believe it was anticipated that
a particular SGML parser would be able to handle documents in only one
character set, and that a document in some other character set could
only be parsed once it had been converted into this character set.
There is no requirement that an SGML system be able to do this
conversion: the conversion process might consist of a human with a hex
editor reading a hardcopy of the SGML declaration. A conforming SGML
system is only required to be able to parse documents whose document
character set is compatible with its system character set: see 15.3.2.

Several modern SGML parsers (SP and, I believe, YASP and Exoterica's
parser) can do rather better than this: they can adapt themselves to
the document character set described in the SGML declaration. In
effect they have a variable system character set rather than a single
fixed system character set. This means their capabilities cannot be
fully described with a system declaration.

To go back to Glenn's question:

> >2. What kind of error should be reported upon an occurrence of a numeric
> >character reference which contains a character number which *is*
> >described by the document character set (by reference to a base set
> >character number) but which *is not* described by the system character
> >set? Or which is decribed by the (formal specified) system character
> >set but which has no bit combination in the (actually implemented)
> >system character set?

It doesn't have to report any error at all. A validating SGML system
only has to report an error if the document is not conforming (and not
always then because of 4.267). The system declaration determines
which conforming SGML documents the system is required to be able to
parse; it doesn't affect which documents are conforming.

If the problem is that the system cannot render such a character, it
can just say so (if it wants to). There is no SGML-conformance issue
here.

If the problem is that the system cannot correctly parse the document
because of such a numeric character reference, then it can just say:
"sorry, the document character set of this document is not compatible
with my system character set; I tried to parse the document anyway,
but now I've found I can't". In this case the vendor should supply a
system declaration that declares as UNUSED those character numbers in
ISO 10646 that it cannot handle.

James