Character Set Terminology, SC2 vs. SC18 vs. Internet Standards [LONG]

Dan Connolly (connolly@w3.org)
Sun, 9 Apr 95 09:52:52 EDT

Glenn Adams writes:
>

Lots of stuff, included below (apologies to those who have seen this
already). He presents a lot of background and evidence, along with
a proposal.

While the background is invaluable, I don't see the proposal as a
significant improvement in clarity over the existing published
verbage.

I think my draft now stands on its own. Please review it (I'm sending
it out as an internet draft, and it's always available at:
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html). I
believe that it presents (and justifies) a terminology that is
rigorous, internally consistent, and sufficiently expressive.

The terminology has not changed in substance since I originally mailed
it to the MIMESGML working group mailing list (is that archived?) I am
quite confident that it is internally consistent, and at least for the
purposes of HTML 2.0, sufficiently expressive. In fact, I believe it
is sufficiently expressive for general ISO standards use.

The draft probably still needs a few more examples, and some general
polishing, and I welcome comments and questions. (The answers usually
make good supporting material for the draft.)

Comments on Gary's Proposal:

> (b) character repertoire : a collection of distinct characters
>
> NOTE - two characters are distinct if and only if they have distinct
> names in the context of an identified character repertoire.

The word is _set_. "collection of distinct..." is just the kind of
dancing around established mathematical terminology that got us in the
mess we're in. And this stuff about "in the context of an identified ..."
is dangerously misleading.

The notion of character is primitive. The other definitions are built
on top of it. It's all well and good to map characters to names via
some widely published 1-1 function F, and to deduce from F(c1) = F(c2)
that c1 = c2. But to define characters in terms of their names is
absurd: what are the names made of, after all?

* Do we really need this "bit combinations" nonsense? Isn't
octet good enough? If not, bit combinations needs to be
treated more rigorously.

I think some examples in the "Character Encoding" section about how,
for example, 7bit and 16bit character encodings are, in practice,
defined in terms of octets will allow us to dispense with this "bit
combinations" stuff.

Here's Gary's Message...

>
> This message started out as a response to Dan Connolly's rough draft
> of a new Internet RFC which he has tentively entitled "'Character Set'
> considered harmful" at the suggestion of John Klensin. As my response
> developed, I decided that it was worth sending to a broader group of
> people in order to spark further discussion (and hopefully not spark
> too many flames).
>
> As John Klensin pointed out at the recent HTML-WG meeting, the ISO
> standards are generally inconsistent regarding terminology and that
> makes it difficult for IETF to adhere to ISO terms or to modify the
> terms developed within the IETF.
>
> Various IETF groups are now trying to deal (yet again) with some of
> these issues, and an impromptu lunch at yesterday led to the rough
> draft from Dan Connolly.
>
> As the editor for the upcoming version of Unicode, I too am having to
> re-address some of these issues as the Unicode Technical Committee attempts
> to deal with a world in which a number of encoding forms for Unicode are
> possible and the eventuality that characters in 10646 outside
> the basic multilingual plane will be accessible and therefore defined
> within the Unicode coding framework (i.e., through UTF-16).
>
> The following is divided into three parts:
>
> 1. comments on basic SC2/SC18 differences
> 2. revised proposal character set terminology for use in RFC
> 3. current standard definitions
>
> For those who haven't asked for this correspondence, please let me
> know if you wish not to receive any further responses. However, if you
> have a stake in this discussion, I would encourage your participation
> as we may be able to influence the direction new internet standards
> take in this area.
>
> Regards,
> Glenn Adams
>
> ---------------------------------------------------------------------
> COMMENTS ON SOME BASIC SC2/SC18 DIFFERENCES
> ---------------------------------------------------------------------
>
> I've extracted the relevant definitions from SC2 and SC18 standards and
> provided them below. A number of comments are required:
>
> 1. All ISO character sets published by SC2 use consistent terminology
> for the definition of "character"; none of these definitions refer
> to "meaning".
>
> 2. The definition of "character" specified in ISO 8879 (as developed
> by SC18) is significantly different from the SC2 definition in the
> following respect: the ISO 8879 definition attributes an "individual
> meaning" to each character as "defined by a character repertoire".
> The term "meaning" is problematic.
>
> This language is not used by SC2 in its definition of characters
> or character sets. In contrast, SC2 defines "characters" by specifying
> for each character (1) a (normative) name which is unique within a
> particular character set; and (2) an (informative) rendering of the
> character in the case of graphic characters.
>
> Even though SC2 does assign a name to each character, it neither defines
> any additional explicit semantics for a character, nor does it imply that
> additional semantics may be inferred from the name. [In practice, however,
> character semantics are often inferred from its name; nevertheless, from
> a definitional perspective, no SC2 standard admits to such an inferrence.]
>
> Given the above, the notion of "uniqueness" and "sameness" devolves to
> the examination of character names irregardless of any other possible
> criteria for equivalence. E.g., the following (octet serialized) coded
> representations are equivalent because each maps to the character of
> ISO/IEC 10646 whose name is "FULLWIDTH LATIN CAPITAL LETTER A"
>
> UCS-2 FF 21 FULLWIDTH LATIN CAPITAL LETTER A
> UCS-4 00 00 FF 21 FULLWIDTH LATIN CAPITAL LETTER A
> UTF-8 C3 BF 21 FULLWIDTH LATIN CAPITAL LETTER A
>
> In contrast,
>
> UCS-2 00 41 LATIN CAPITAL LETTER A
>
> is not the "same" character as the former in terms of the identification
> of character names, even though an application (or a user) may wish to
> treat these as having the same "meaning" in the sense of the user's
> interpretation of linguistic meaning.
>
> 3. The definition of "character repertoire" as used by ISO 8879
> similarly refers to each character having a "meaning" in contrast to
> the SC2 definition (see ISO/IEC 2022 terms below) which does not.
>
> If by "meaning" ISO 8879 denotes the name of the character and nothing
> more, then the SC18 definition would correspond operationally to the SC2
> definition. However, as it currently stands, the SC18 use of "meaning"
> is unclear. If by "meaning" is meant "function" or "usage", then the
> SC18 definition is clearly wrong since neither function nor usage are
> defined or are limited for a particular character. For example, LATIN
> CAPITAL LETTER A may mean an element of the English alphabet in a text
> of English language or it may mean a hexidecimal digit in a printed form
> of binary data, etc. In general, the SC2 position on the "meaning of a
> character" in the broader sense is that it can only be determined by
> the application(s) which employ the character.
>
> ---------------------------------------------------------------------
> PROPOSED REVISIONS TO ROUGH DRAFT OF RFC ON CHARACTER SET TERMINOLOGY
> ---------------------------------------------------------------------
>
> In Dan's rough draft of the RFC mentioned above, the various character
> set terms are currently presented in the following order:
>
> character
> coded character set
> code position
> character repertoire
> character encoding
>
> I would suggest these terms be presented in the following order and
> revised using the definitions which I've provided below. I arrange
> these in four categories related to characters, codes, coded characters,
> and coded character encoding schemes, respectively.
>
> Category 1
>
> character - use modified SC2 definition
> character repertoire - new definition
>
> Category 2
>
> bit combination - use SC18 definition
> code set - use SC18 definition
> code set position - use SC18 definition
>
> Category 3
>
> coded character set - new definition
> coded representation - new definition
> principal coded representation - new definition
> alternate coded representation - new definition
>
> Category 4
>
> character encoding - new definition
>
> Category 1
>
> (a) character : a unit of information used for the organisation,
> control, or representation of data.
>
> NOTE - a character is associated with a name and, optionally, with
> a representative image (rendering).
>
> The SC2 definition (from 10646) is a good foundation and has been
> accepted by both SC2 and SC18 in recent discussions on unifying
> terminology. However, as Dan points out, the SC2 definition refers
> to a character as "a member of a set of elements". I've altered the
> language to remove the reference to an encompassing set.
>
> This altered definition brings it into somewhat closer concord with
> the SC18 definition as an "atom of information" but without introducing
> the term "meaning" or "repertoire".
>
> The note provided with the definition indicates the two formal
> properties of a character: (1) it is named; and (2) it may have
> a representative image.
>
> (b) character repertoire : a collection of distinct characters
>
> NOTE - two characters are distinct if and only if they have distinct
> names in the context of an identified character repertoire.
>
> NOTE - two characters that are distinct in name may have identical
> images (renderings).
>
> NOTE - SC2 standards sometimes refer to a character repertoire as
> a "character set"
>
> This definition differs from both SC2 and SC18 definitions in that
> (1) unlike the SC2 definition (in ISO/IEC 2022), it does not refer to
> bit combinations or a coded character set; and (2) unlike the SC18
> definition, it does not require its elements to either be "used together"
> or to have a defined meaning. What it does do, however, is focus on
> distinctness as a function of the distinctness of character names (as
> the only identified "meaning" specified by SC2). Further, as the
> note points out, distinctness does not imply distinct renderings.
>
> Category 2
>
> (c) bit combination : an ordered collection of bits, interpretable
> as a binary number
>
> The SC18 definition is prefereble over the SC2 definition (from ISO/IEC 2022)
> since the latter refers to both characters and their representation. Further,
> the SC18 definition specifies that a bit combination may be interpreted as
> a single number.
>
> (d) code set : a set of bit combinations of equal size, ordered by
> their numeric values, which must be consecutive.
>
> As specified by ISO 8879.
>
> (e) code set position : the location of a bit combination in a code
> set; it corresponds to the numeric value of the bit combination.
>
> As specified by ISO 8879.
>
> Category 3
>
> (f) coded character set : a one-to-one mapping from a character
> repertoire to a code set.
>
> NOTE - no "onto" relationship is implied; that is, there may be
> bit combinations in a code set which do not correspond to a character
> in the character repertoire.
>
> NOTE - SC18 standards sometimes refer to a coded character set as
> a "character set"
>
> Both SC2 and SC18 definitions are problematic, so a new definition is
> provided. The SC2 definitions differ (as to whether one-to-one is specified),
> refer to "character set" (used there to mean character repertoire), and
> refer to "coded representation" for which we need to establish multiple
> representations (see below). The SC18 definition is bad because it uses
> the term "character set" instead of "coded character set" and therefore
> conflicts with the SC2 definition wherein "character set" is used to
> refer to "character repertoire". Furthermore, the SC18 definition uses
> the term "onto" which may erroneously be interpreted as defining an covering
> "onto" relationship. The new definition also avoids using the term
> "coded representation" as found in the SC2 definitions.
>
> (g) coded representation : a sequence of one or more bit combinations which
> unambiguously represent a character in the domain of an identified
> coded character set.
>
> NOTE - in the present context, "coded representation" implies an
> object which is represented; namely, a character, and not a non-
> character unit of information.
>
> NOTE - a given character may be represented according to more than one
> coded representation. Each distinct coded representation of a character
> is referred to as a "coded representation form" or as a "form of use".
>
> SC2 standards do not explicitly define this term; on the other hand, SC18
> standards (ISO 8879) do not admit to multiple coded representation forms;
> therefore a new definition is required to permit the latter.
>
> (h) principal coded representation : a coded representation consisting of
> a single bit combination in the range of an identified coded character set.
>
> NOTE - the bit combination of a character's principal coded representation
> is identical to the bit combination of the code set position assigned to
> that character by the identified coded character set.
>
> NOTE - the principal coded representation of a character may or may not
> serve as its canonical representation according to the customary usage
> established by an identified coded character set; that is, a coded
> character set may specify that a coded representation form other than
> the principal coded representation serve as the canonical form of use.
>
> NOTE - ISO 8879:1986 refers to a principal coded represenation as "coded
> representation"; that is, it does not explicitly permit more than one
> coded representation form.
>
> This is a new term required to distinguish between the single bit combinations
> which comprise the code set serving as the range of a coded character set and
> other bit combinations or sequences of bit combinations which may alternatively
> represent a given character.
>
> (i) alternate coded representation : a coded representation of a character
> other than its principal coded representation.
>
> Category 4
>
> Finally, we arrive at the raison d'etre of this entire discussion; namely,
> how to specify the overall encoding used by a particular document which
> satisfies the MIME content type of text/*. With this content type one may
> optionally specify a parameter introduced by the token "charset". A number
> of values for this parameter are currently specified and administered by
> the IANA (Internet Assigned Numbers Authority). Among these values are:
>
> us-ascii
> iso-10646-ucs-2
> iso-2022-jp-2
> iso-2022-kr
> euc-kr
> ...
>
> The first thing to notice here is that not all of these tokens denote
> entities which may be described as coded character sets or character
> repertoires. For example, "iso-2022-jp-2" is defined in RFC 1554 (by
> M. Ohta) as "a text encoding scheme." This RFC goes on to state:
>
> "The text with "ISO-2022-JP-2" starts in ASCII, and switches to other
> character sets of ISO 2022 through limited combinations of escape
> sequences."
>
> In RFC 1557 (U. Choi, et al.), "iso-2022-kr" and "euc-kr" are defined
> as "encoding methods". The description says:
>
> "It is assumed that the starting code of the message is ASCII. ASCII
> and Korean characters can be distinguished by use of the shift function.
> For example, the code SO will alert us that the upcoming bytes will be
> a Korean character as defined by KSC 5601. To return to ASCII, the SI
> code is used."
>
> "The KSC 5601 character set that includes Hangul, Hanja (Chinese
> ideographic characters), graphic and foreign characters, etc., is
> two bytes long for each character."
>
> It is clear from the above that the values expressed by the "charset"
> parameter are not simply character sets (e.g., us-ascii), but, rather,
> character encoding schemes (or coding schemes). For example, ISO-2022-JP-2
> is actually shorthand for an ISO 2022 announcer sequence
>
> ESC 2/0 4/1 Use G0 as GL
> ESC 2/0 4/6 Use C1 with ESC Fe
> ESC 2/0 5/10 Use G2 with SS2
>
> along with an implicit promise that only escape sequences to designate
> the following character sets will be used:
>
> ASCII ESC 2/8 4/2 G0
> JIS X 0208-1978 ESC 2/4 4/0 G0
> JIS X 0208-1983 ESC 2/4 4/2 G0
> JIS X 0201-Roman ESC 2/8 4/10 G0
> GB2312-1980 ESC 2/4 4/1 G0
> KSC5601-1987 ESC 2/4 2/8 4/3 G0
> JIS X 0212-1990 ESC 2/4 2/8 4/4 G0
> ISO8859-1 ESC 2/14 4/1 G2
> ISO8859-7(Greek) ESC 2/14 4/6 G2
>
> As such, this "encoding scheme" makes reference to 8 distinct coded
> character sets (in one case two versions of a single coded character set).
>
> In the context of SGML (and its application in HTML), it is not
> possible to treat this parameter as specifying a single (coded) character
> set in the sense that such a (coded) character set could be declared as
> a document (coded) character set. At most, one could employ either (1)
> the code extension techniques of ISO 8879 to gain access to data
> characters available through the extension technique which were not in
> the document (coded) character set, or (2) a document (coded) character
> set whose character repertoire is a superset of the invokable character
> sets in such an encoding scheme (e.g., ISO/IEC 10646).
>
> Given the above, it is clear that the use of the token "charset" by
> MIME is problematic. Even more so is descriptive text which refers
> to this parameter as specifying the "character set" or even "coded
> character set" of a given content instance. While it is too late to
> change the token "charset" to something which better suits its use,
> it is not too late to change the descriptive text of MIME or other
> Internet standards which make use of this parameter so as to avoid
> confusing terminology.
>
> This situation seems to be clearly recognized by the current editors
> of various Internet standards. To address this issue, the following
> term is provided:
>
> (j) character encoding scheme : an algorithm which specifies a unique
> mapping from a sequence of bit combinations to a sequence of
> characters each of which is a member of some identified character
> repertoire.
>
> NOTE - no unique inverse mapping is required; that is, a sequence of
> characters may be encoded as more than one possible sequence of bit
> combinations.
>
> NOTE - the sequence of bit combinations which serve as the input
> to the mapping function may be construed as a sequence of coded
> chraracter representations possibly preceded by, interspersed with,
> or followed by escape functions, the specification of which must be
> made explicit by the character encoding scheme.
>
> A matter of discussion is whether or not the range of the mapping specified
> by an encoding scheme should be a sequence of characters or a sequence
> of principal coded (character) representations (i.e., integers). In both
> cases, the identify of a character or the character denoted by an integer
> must rely on explicit identification of either the character repertoire with
> each character (in the first case), or on the coded character set with each
> integer (in the second case).
>
> If we were to express this definition in the latter case, it would look
> something like:
>
> (j') character encoding scheme : an algorithm which specifies a unique
> mapping from a sequence of bit combinations to a sequence of
> integers, each of which is interpreted as the principal coded
> representation of a character in some identified coded character set.
>
> NOTE - the sequence of integers (primary coded representations) produced
> by the algorithm is generally interpreted as the sequence of characters
> so represented rather than as a sequence of integers per se.
>
> NOTE - no unique inverse mapping is required; that is, a sequence of
> characters may be encoded as more than one possible sequence of bit
> combinations.
>
> NOTE - the sequence of bit combinations which serve as the input
> to the mapping function may be construed as a sequence of coded
> chraracter representations possibly preceded by, interspersed with,
> or followed by escape functions, the specification of which must be
> made explicit by the character encoding scheme.
>
> Both of these two definitions have some shortcomings. In the case of (j),
> the output is construed as a sequence of characters each identified according
> to the character repertoire of which it is a member. We are not in the custom
> of identifying character repertoires so much as coded character sets, although
> a coded character set cospecifies both a character repertoire and a
> corresponding code set. In the case of (j'), the result is expressed as a
> sequence of integers rather than characters, even though the end result is
> a sequence of characters.
>
> My current tendency is to favor (j) over (j').
>
> ---------------------------------------------------------------------
> CURRENT STANDARD DEFINITIONS
> ---------------------------------------------------------------------
>
> ISO/IEC 10646-1:1993
>
> 4.6 character : a member of a set of elements used for the organisation,
> control, or representation of data.
>
> 4.8 coded character : a character together with its coded representation.
>
> 4.9 coded character set : a set of unambiguous rules that establish a
> character set and the relationship between the characters of the set and
> their coded representation.
>
> 4.3 canonical form : the form with which characters of this coded character
> set are specified using four octets to represent each character.
>
> Other terms used but not defined by ISO/IEC 10646-1:1993
>
> coded representation form -
>
> 14 Coded representatioin forms of the UCS
>
> ISO/IEC 10646 provides two alternative forms of coded representation
> of characters.
>
> 14.1 Two-octet BMP form
>
> This coded representation form permits the use of characters from the
> Basic Multilingual Plane with each character represented by two octets.
>
> NOTE - a coded graphic character using the two-octet BMP form may
> be implemented by a 16-bit integer for processing.
>
> 14.2 Four-octet canonical form
>
> The canonical form permits the use of all the characters of ISO/IEC 10646,
> with each character represented by four octets.
>
> NOTE - a coded graphic character using the four-octet canonical form may
> be implemented by a 32-bit integer for processing.
>
> Annex P (Normative) UCS Transformation Format 8 (UTF-8) [in ISO/IEC
> 10646-1: 1993/AMD.2: 1995 (E)]
>
> UTF-8 is an alternative coded representation form for all the characters of
> the UCS.
>
> ISO/IEC 2022:1994 (E)
>
> 4.1 bit combination : an ordered set of bits used for the representation
> of characters.
>
> 4.5 coded character set; code : a set of unambiguous rules that establish
> a character set and the one-to-one relationship between the characters
> of the set and their coded representation.
>
> 4.19 repertoire : a specified set of characters to be represented by
> one or more bit combinations of a coded character set.
>
> ISO 8879:1986
>
> 4.31 character : an atom of information with an individual meaning,
> defined by a character repertoire.
>
> 4.38 character repertoire : a set of characters that are used together.
> Meanings are defined for each character, and can also be defined for
> control sequences of multiple characters.
>
> 4.39 character set : a mapping of a character repertoire onto a
> code set such that each character in the repertoire is represented
> by a bit combination in the code set.
>
> 4.24 bit combination : an ordered collection of bits, interpretable
> as a binary number
>
> 4.43 code set : a set of bit combinations of equal size, ordered by
> their numeric values, which must be consecutive.
>
> 4.44 code set position : the location of a bit combination in a code
> set; it corresponds to the numeric value of the bit combination.
>
> 4.45 coded representation : the representation of a character as a
> single bit combination in a code set.
>
> 4.36 character number : a number that represents the base-10
> integer equivalent of the coded representation of a character.
>
> 4.42 code extension : techniques fo including in documents the coded
> representation of characters that are not in the document character set.
>
> Other terms used but not defined by ISO 8879:1986 -
>
> coding scheme - used in 4.93 & 4.244, e.g.
>
> 4.244 public text display version : an optional portion of a text
> identifier that distinguishes among public text that has a common
> public text description by describing the devices supported or the
> coding scheme used. If omitted, the public text is not device-dependent.
> ^^^^^^^^^^^^^
>
> -------------------------------------------------------------
> END OF TEXT