Re: Character Set Terminology, SC2 vs. SC18 vs. Internet Standards

Glenn Adams (glenn@stonehand.com)
Sun, 9 Apr 95 14:33:36 EDT

Date: Sun, 9 Apr 1995 09:35:02 +0500
From: connolly@w3.org (Dan Connolly)

While the background is invaluable, I don't see the proposal as a
significant improvement in clarity over the existing published
verbage.

I do not believe your currernt proposal is as clear as it should be (that is,
if you wish to shed light on a shadowy subject). See details below.

Comments on Gary's Proposal:

"Glenn" not "Gary".

> (b) character repertoire : a collection of distinct characters
>
> NOTE - two characters are distinct if and only if they have distinct
> names in the context of an identified character repertoire.

The word is _set_. "collection of distinct..." is just the kind of
dancing around established mathematical terminology that got us in the
mess we're in. And this stuff about "in the context of an identified ..."
is dangerously misleading.

This is not a "dance around mathematical terminology." The term "collection"
is different from "set". A "set" is "a collection of distinct elements having
specific common properties". Since we have not identified (and do not intend
to identify) a set of common properties, it is inaccurate to refer to it as
a set. Otherwise, we will have to establish a criteria for membership which
we have not done.

The note about distinctness being bound to an identified character repertoire
is a necessary explanation since we can only define distinctness in terms of
the members of a particular character repertorire. That is, we cannot define
the notion of distinctness in a universal fashion (at least without making
recourse to other semantic propertiies which haven't been introduced so far).
In particular, I can't say whether or not LATIN CAPITAL LETTER A in one
character repertoire is distinct from LATIN CAPITAL LETTER A in another
repertoire. [Note that this doesn't prevent one from identifying the two
by means of a mapping; it simply prevents doing so based on any ontological
charactersitics of each element.]

The notion of character is primitive. The other definitions are built
on top of it. It's all well and good to map characters to names via
some widely published 1-1 function F, and to deduce from F(c1) = F(c2)
that c1 = c2. But to define characters in terms of their names is
absurd: what are the names made of, after all?

Au contraire. How would you define characters then? How would you define
distinctness? You cannot do so based on the notion of a representative image
of a character (i.e., you cannot establish a definition of a character in
a strictly on the basis of its "form"); nor can you do so on the basis of its
meaning "function". The following counterexamples suffice:

(1) Disregarding linear scaling, the actual (and not even abstract) images of
ISO/IEC 10646 coded characters 0x0041 LATIN CAPITAL LETTER A and 0xFF21
FULLWIDTH LATIN CAPITAL LETTER A have identical forms. In addition, the
following characters also have identical representative forms in
ISO/IEC 10646:

0060 GRAVE ACCENT
02CB MODIFIER LETTER GRAVE ACCENT (Mandarin Chinese fourth tone)
0300 COMBINING GRAVE ACCENT (Varia)
0953 DEVANAGARI GRAVE ACCENT
1FEF GREEK VARIA
FF40 FULLWIDTH GRAVE ACCENT

Many additional counterexamples can be provided.

(2) The following characters in ISO/IEC 10646 may have the same function:
to denote the integer 1:

0030 DIGIT ONE
00B9 SUPERSCRIPT ONE
0661 ARABIC-INDIC DIGIT ONE
06F1 EXTENDED ARABIC-INDIC DIGIT ONE
0967 DEVANAGARI DIGIT ONE
09E7 BENGALI DIGIT ONE
09F4 BENGALI CURRENCY NUMERATOR ONE
0A67 GURMUKHI DIGIT ONE
0AE7 GUJARATI DIGIT ONE
0B67 ORIYA DIGIT ONE
0BE7 TAMIL DIGIT ONE
0C67 TELUGU DIGIT ONE
0CE7 KANNADA DIGIT ONE
0D67 MALAYALAM DIGIT ONE
0E51 THAI DIGIT ONE
0ED1 LAO DIGIT ONE
2081 SUBSCRIPT ONE
215F FRACTION NUMERATOR ONE
2160 ROMAN NUMERAL ONE
2170 SMALL ROMAN NUMERAL ONE
2461 CIRCLED DIGIT ONE
2474 PARENTHESIZED DIGIT ONE
2488 DIGIT ONE FULL STOP
2776 DINGBAT NEGATIVE CIRCLED DIGIT ONE
2780 DINGBAT CIRCLED SANS-SERIF DIGIT ONE
278A DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE
3021 HANGZHOU NUMERAL ONE
3192 IDEOGRAPHIC ANNOTATION ONE MARK
3220 PARENTHESIZED IDEOGRAPH ONE
3280 CIRCLED IDEOGRAPH ONE
4E00 CJK UNIFIED IDEOGRAPH-4E00
58F9 CJK UNIFIED IDEOGRAPH-58F9
FF11 FULLWIDTH DIGIT ONE

Given the above, it is not possible to uniformly use either formal or
functional criteria in order to establish distinctness. The only thing
left is a character's name, which, in the current ISO character standards
is the only normative property of a character.

* Do we really need this "bit combinations" nonsense? Isn't
octet good enough? If not, bit combinations needs to be
treated more rigorously.

Yes, "bit combination" is needed. For example, in the case of ISO/IEC 10646
UCS-2 (and Unicode), and UCS-4, one is dealing with bit combinations that
consist of 16 and 32 bits respectively. This has an impact on byte (and
half word) ordering since these standards do not dictate byte (or half word)
order in internal or external coded representations.

While 10646 does specify that a high-endian byte order should be used when
serializing (coded representations) as octets, it does not state that one
must serialize it as octets. Unicode specifically defines a "byte-order
mark" property of the coded character FFEF ZERO WIDTH NO-BREAK SPACE in
order to provide a self-identification mechanism to resolve byte order
differences.

See the note under definition 4.24 of ISO 8879:

NOTE - a bit combination should not be confused with a "byte", which is a
name given to a particular size of bit string, typically seven or eight
bits. A single bit combination could contain several bytes.

----------------------------------------------------------------------
----------------------------------------------------------------------

Regarding the definitions you currently provided in your draft RFC:

----------------------------------------------------------------------

[Dan]
character : an atom of information; for example, a letter or a digit.

[Glenn]
(a) character : a unit of information used for the organisation,
control, or representation of data.

There isn't a whole lot to argue here. Dan's definition is a bit closer
to the ISO 8879 definition which uses the phrase "atom of information".
My definition is a bit closer to the standard SC2 definition, but omits
reference to a containing set.

If an example is necessary, as in "for example, a letter or digit", then
I would prefer making it into a note rather than as part of the definition.
I would also extend that note to read somethine like the following:

NOTE - when representing data, the nature of that data is generally
symbolic as opposed to some other kind of data (e.g., numeric, aural,
visual, etc.). Examples of such symbolic data include: letters, syllabograms,
ideographs, digits, punctuation, technical symbols, dingbats, etc.

In no case, however, should the following information be omitted:

NOTE - a character is associated with a name and, optionally, with
a representative image (rendering).

----------------------------------------------------------------------

[Dan]
coded character set : a function whose domain is a subset of the integers,
and whose range is a set of characters.

[Glenn]
(f) coded character set : a one-to-one mapping from a character
repertoire to a code set.

These two definitions are considerably different.

Dan uses the phrase "set of characters" without defining the
membership criteria for such a set. I avoid this by not using
the term set at all, and instead using the formally defined term
"character repertoire" which is based on the notion of a collection
(which has no necessary membership criteria) rather than a set.

Dan also uses "a subset of the integers" where I use the formally
defined term "code set".

Finally, I express the relationship as a "one-to-one", but not "onto"
mapping, while Dan expresses the relationship as a function. More
importantly, I define the relationship as a mapping from characters to
code set positions, while Dan expresses it as a function from integers
(i.e., code set positions) to characters.

In practice, the designers of coded character sets start with a
character repertoire, and then, based on the expected number of
elements in the repertoire, determine a code set architecture which
will accommodate the repertoire. To express a coded character set
as a function from integers (as code set positions) to characters
fails to recognize the ontological priority of characters.

----------------------------------------------------------------------

[Dan]
code position : an integer that maps to a character via some
coded character set.

[Glenn]
(e) code set position : the location of a bit combination in a code
set; it corresponds to the numeric value of the bit combination.

Dan refers to both character and coded character set. In contrast, my
definition refers to only "bit combination" and "code set", both of
which are defined independently of any reference to "character" or
"coded character set".

Dan's definition also is deficient in that there are code positions
which do not map to (correspond with) any character.

----------------------------------------------------------------------

[Dan]
character repertoire : a set of characters; that is, the range of a
coded character set.

[Glenn]
(b) character repertoire : a collection of distinct characters

NOTE - two characters are distinct if and only if they have distinct
names in the context of an identified character repertoire.

Dan's definition is circular since it defines itself as the range of
a coded character set whose definition is in terms of a function
whose "range is a set of characters", i.e., character repertoire.

Furthermore, Dan's definition fails to specify an extremely important
aspect of a repertoire; namely, that its elements are distinct, and
the basis for determining that distinction.

Finally, we need to maintain a referential separation from a character
repertoire and its coding as a coded character set.

----------------------------------------------------------------------

[Dan]
character encoding : a function whose range [sic] is the set of sequences
of octets, and whose range is the set of sequences of characters
over some character repertoire.

[Glenn]
(j) character encoding scheme : an algorithm which specifies a unique
mapping from a sequence of bit combinations to a sequence of
characters each of which is a member of some identified character
repertoire.

(j') character encoding scheme : an algorithm which specifies a unique
mapping from a sequence of bit combinations to a sequence of
integers, each of which is interpreted as the principal coded
representation of a character in some identified coded character set.

First, Dan's definition seems to have a typo in that it specifies a function
without a domain. I would guess he means "a function whose domain is the set
of sequences of octets ...".

Now, Dan's definition is in terms of a function rather than an algorithm.
While the use of the term "function" may apply in an abstract sense, it
is not very useful in practice, particularly since the presumed domain of
this function is expressed as the uncountable set consisting of all
"sequences of octets". Even this is too broad a definition, since, in fact,
no character encoding scheme operates on all sequences of octets.

In practice, character encoding schemes are expressed algorithmically rather
than in a set-theoretic form. Furthermore, the domain of such an algorithm
is usually limited to sequences of coded (character) representations possibly
interspresed by escape sequences and/or control functions. This is a greatly
reduced set in comparison to the set consisting of all sequences of octets.

Finally, Dan's definition expresses that the sequences of characters which
consitute the range of the function are members of a single character
repertoire. This is not true in practice. For example, the ISO-2022-JP-2
character encoding scheme makes reference to 9 character repertoires in
its definition. Nowhere does it refer to the superset of these repertoires
as a single repertoire.

----------------------------------------------------------------------

The reason I separated my set of definitions into four categories in my
original message was that these categories facilitate a non-cyclical,
directed graph in terms of reference. There are actually two graphs
depending on which definition is chosen for "character encoding scheme".

If we use definition (j), we have:

Category 1 Category 2
(characters) (codes)
| | | |
| ------------ ----------- |
| \ / |
| \/ |
| /\ |
| / \ |
| ------------ ----------- |
| | | |
Category 3 Category 4
(coded characters) (encoding schemes)

If we use definition (j'), we have:

Category 1 Category 2
(characters) (codes)
| |
----- -----
| |
Category 3
(coded characters)
|
|
Category 4
(encoding schemes)

In this case, category 4 also makes use of terms fromm categories
1 and 2, i.e., it inhereits these definitions from category 3.

----------------------------------------------------------------------

I could go on to provide even more comments about Dan's definitions;
however, I don't currently have the time to do so. I would suggest
that the IETF WGs do not rush to accept Dan's definitions as currently
stated. I have taken considerable pains to present what I think is
not only a more logically consistent set of definitions but also one
which attempts to reflect both current practice in coded character set
design and in the existing usage of terms surrounding this area.

Regards,
Glenn Adams

[P.S. You may wish to review the paper entitled "Character/Glyph Operational
Model" at <URL:http://www.stonehand.com/unicode/standard/cgmodel.html> for
more background information.]