HTML Character Representation/Transmission Model

Glenn Adams (glenn@stonehand.com)
Tue, 11 Apr 95 00:38:16 EDT

Larry,

I've thought about the subject a bit more and have developed
the following model for the treatment of character data in
the HTML world. I hope this helps shed light on things a bit.
I've altered my recent thinking somewhat about Gavin's proposal,
and now find myself agreeing with him, at least to the extent
that he proposed a universal document character set.

In the following discussion, "character set" is used to mean
"coded character set" as distinct from "character repertoire".

The general model of HTML document storage and transmission between
a server and client is as follows:

Data Form Encoding

SERVER'S ENTITY (DCS) = document character set

|
|
V

SERVER'S (SCS) = storage character set
STORAGE OBJECT

|
|
V

INTERCHANGED (TCS) = transmission character set
STORAGE OBJECT

|
|
V

CLIENT'S (SCS')
STORAGE OBJECT

|
|
V

CLIENT'S ENTITY (DCS')

Two of these data forms are more abstract than the others; namely,
the data forms: SERVER'S ENTITY and CLIENT'S ENTITY. In the context
of HTML, these entities are coterminal with documents, since, in
current practice, an HTML document is not typically fragmented into
multiple entities (although SGML admits to such fragmentation).

The SGML notion of a "document character set" as specified by an
SGML declaration applies only to these two entities, thus, DCS and
DCS' are expressed in terms of the SGML notion of a document
character set.

Most current HTML applications assume a document character set rather
than explicitly specify it by means of an SGML declaration. This is
currently assumed to be ISO 8859-1.

According to ISO 8879:1986, numeric character references are interpreted
in the context of an entity. Therefore, the character number which
these references resolve to must be interpreted either according to
DCS or DCS'.

SGML allows for DCS and DCS' to be different. In this case, a numeric
numeric character reference (=N) in the SERVER'S ENTITY must be translated
to a numeric character reference (=N') in the CLIENT'S ENTITY such that, if
n is the character number (i.e., code position) contained in a numeric
character reference and ccs(n) is the character c which maps to code
position n in the coded character set ccs, then:

C' = C <==> DCS'(N') = DCS(N)

For example,

C' = LATIN SMALL LETTER DOTLESS I
C = SMALL LETTER i WITHOUT DOT ABOVE

requires that, given,

DCS = ISO 8859-3
DCS' = ISO/IEC 10646-1:1993
N = 185 (= 0xB9)

then

N' = 305 (= 0x0131)

Note that we begin with the identification of the differently named,
but in our eyes (at least for this example), equivalently named characters:

LATIN SMALL LETTER DOTLESS I
SMALL LETTER i WITHOUT DOT ABOVE

and work from this identification back to the character numbers which
insure such identification.

[I explicitly chose a character which had two distinct names in two
distinct standards. A character can only be identified as unique within
the scope of a single character repertoire since that is what defines
the particular domain of names. The fact the we have chosen to identify
these distinctly named characters is based on customary usage and
practice; that is, the character whose name is LATIN SMALL LETTER DOTLESS I
in ISO/IEC 10646 is conventionally used in the same fashion as the
character whose name is SMALL LETTER i WITHOUT DOT ABOVE in ISO 8859-3.]

Now, even though SGML allows DCS and DCS' to be distinct, the current
practice among HTML applications is that DCS and DCS' are identical
and equal to ISO 8859-1. In this regard, a numeric character reference
in either a SERVER'S ENTITY or a CLIENT'S ENTITY must refer to the
same character which must be in ISO 8859-1.

If a single document character set (DCS) is required by the HTML
standard, then a more expressive character set may be desirable, such
as ISO/IEC 10646. Note that such a decision does not affect the actual
character set used by the storage objects used to store or transmit
HTML documents (as SGML entities). Furthermore, it is a natural
extension of ISO 8859-1, since, ISO 8859-1 is a proper subset of
ISO/IEC 10646, both in its repertoire, its code set, and its assigned
mappings.

If on the other hand, multiple document character sets are permitted
by the HTML standard, then not only must character set translation
occur (as may already be the case when translating between storage
objects), but, in addition, translation of numeric character references
must occur. This means that, using the example shown above, the numeric
character reference &#185; in a server entity which uses the document
character set ISO 8859-3 must be translated to the numeric character
reference &#305; in a client entity which uses the document character
set ISO/IEC 10646-1. According to the current definition of SGML in
ISO 8879:1986, this translation would have to occur prior to the client
entity being interpreted by the client application. One could, however,
with some loss of generality, defer this translation to the point where
the numeric character reference is resolved by the client. This would
require that the server communicate the document character set of
server entity to the client so that when it did resolve the numeric
character reference it first translate it as described above.

----------------------

Given the above, a couple of comments are required:

1. current HTML documents use numeric character reference in terms
of the storage object character set of the original entity and
not necessarily any particular document character set, i.e., not
in relationship to an entity but rather to a storage object.

2. the term "entity-body" as used by HTTP spec to refer to the
body of a response message (as opposed to headers) is not to be
confused with an "entity" in the sense used by SGML; rather, it
should be treated as a storage object, and, in particular, an
interchange storage object.

3. the "charset" parameter to a "Content-Type" header which applies
to an "entity-body" does not specify the document character set of
the entity being transmitted, but rather, it specifies the character
encoding scheme of the interchange storage object being transmitted.
this may or may not be the same as the document character set of
the entity being transmitted through the storage object.

----------------------

After considering the above, I think that overall, it is probably
best to specify a single document character set and to make that
character set ISO/IEC 10646-1:1993. If it were not for the problem
of translating numeric character references, one could justify allowing
multiple document character sets. Further, one cannot specify that
all numeric character references refer to one document character set
(e.g., ISO/IEC 10646), and then specify another document character set
for the entity in which such references occur. At least one can't do
so and remain consistent with SGML.

In my previous mail on this subject, I was operating under the assumption
that one could have a numeric character reference refer to the character
set as used by an entities storage object. Although this is actually quite
common in practice, it is non-conformant according to SGML. Thus I would
now agree with the proposal of G. Nicol et al. to use ISO/IEC 10646 as
a universal document character set for HTML. Note:

(1) this *does not* require that ISO/IEC 10646 actually be used by any
storage object, i.e., either by the source HTML file, the transmitted HTTP
"entity-body", or the HTML client.

(2) this *does* require that generating and interpreting HTML agents adhere
to the use of ISO/IEC 10646 code positions for all numeric character
references. interpreting HTML agents which are unable to resolve such a
reference in terms of the system character set employed by the agent
should display either a missing glyph image, or make some other reasonable
presentation to the user; e.g., depict the reference as the characters which
compose the reference, etc.

Regards,
Glenn