Re: HTML 2.0 comments (First of two)

James D Mason (MASONJD@oax.a1.ornl.gov)
Wed, 23 Nov 94 16:30:40 EST

I am in general sympathy with the suggestion to support an internationalized
character set in HTML, but there are even more issues to be considered than
those in the posting.

There is nothing in SGML itself that prevents the use of alternative character
sets. James Clark's new parser supports both UTF8 and UCS2. Annex D of ISO
8879 describes mechanisms for supporting multibyte character sets (this is
not, however, a normative part of the standard). The use of character entities
of the form "&#nnn;" isn't a feature of HTML; it is part of SGML itself. While
there is no requirement that an HTML user agent be a conforming application of
ISO 8879, all the common conforming applications (e.g., parsers, editors) do
support character entities as a matter of course.

There are actually two issues here: the character codes used in data
communications and the graphical objects associated with those codes by
rendering engines (e.g., HTML user agents). Character codes are defined by the
standards like ISO 664 and ISO/IEC 10646, produced by ISO/IEC JTC1/SC2. The
graphical objects associated with these codes are not necessarily fixed
things. Abstract graphical objects, called "glyphs", are defined in ISO/IEC
9541, and procedures for registering glyphs are defined by ISO/IEC 10036,
produced by ISO/IEC JTC1/SC18/WG8, the same folks who brought you SGML. The
Association for Font Information Interchange (AFII) maintains a registry of
glyph identifiers at Rochester Institute of Technology. AFII actually provided
the glyph images that were used to publish ISO/IEC 10646.

A fully developed system for interchanging and presenting textual documents
thus includes mechanisms for generating/accepting character codes, a set of
glyph images, and a means for mapping images onto character codes.
Traditionally the mapping has been a fixed one, built into hardware in the
case of ASCII terminals or built into operating systems or GUIs. There hasn't
generally been much of a mechanism for users to modify the mappings, other
than to change the system font or, for the truly adventurous, to rebuild a
PostScript encoding vector. However, ISO/IEC DIS 10179, DSSSL, provides
standardized ways of performing such mappings. Rendering the results of such
mappings is described in ISO/IEC 10180, SPDL, the ISO version of PostScript.
(Those standards are new, and thus not fully implemented, though implentations
are on the way.)

The issues of rendering character codes are mostly of concern to the
implementors of user agents, who must either accept the facilities provided by
the operating systems or roll their own. HTML itself is concerned with these
matters directly only when it comes to defining a concrete syntax in its SGML
declaration and whether it wishes to provide support for character entity sets
outside ISOLatin1. Beyond that, it's mostly at the mercy of what content types
MIME supports and what the creators of user agents can be convinced to
support.

The question of internationalization of network facilities such as the WWW is
worthy of study. I've brought up this additional alphabet soup of standards
mostly to suggest that it is not a simple issue but rather one that deserves
careful consideration.

If anyone really wants to get into the complexities of character sets, glyph
sets, glyph identifiers, and mechansims, I will be glad to provide contacts.

Dr. James D. Mason
(WG8 Convenor)
Oak Ridge National Laboratory
Information Management Services
Bldg. 2506, M.S. 6302, P.O. Box 2008
Oak Ridge, TN 37831-6302 U.S.A.
Telephone: +1 615 574-6973
Facsimile: + 1 615 574-6983
Network: masonjd @ ornl.gov