Latin-1 references

Jon_Bosak@Novell.COM
Thu, 15 Jun 95 01:30:52 EDT

Some references to the Latin-1 character set need changing in the HTML
2.0 draft. I mailed Dan about this a couple of weeks ago and would
have done so again but for his insistence that such comments be posted
to the list.

| The minimum character repertoire supported by all conforming
| HTML user agents is Latin Alphabet No. 1, or simply Latin-1.
| Latin-1 includes characters from most Western European
| languages, as well as a number of control characters. Latin-1
| also includes a non-breaking space, a soft hyphen indicator, 93
| graphical characters, 8 unassigned characters, and 25 control
| characters.

The description in the last sentence is incorrect. As stated in the
draft specification, Latin-1 is defined by ISO 8859-1, 8-Bit
Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet
No. 1. The character set defined by 8859-1 does not consist of "a
non-breaking space, a soft hyphen indicator, 93 graphical characters,
8 unassigned characters [whatever that means], and 25 control
characters". Rather, it consists of 191 graphic characters,
explicitly including NO-BREAK SPACE and SOFT HYPHEN, and no control
characters at all. Here are the applicable parts of the standard:

1 Scope

This part of ISO 8859 specifies a set of 191 graphic
characters identified as Latin alphabet No. 1.

[...]

5.5 graphic character: A character, other than a control
function, [...]

[...]

6.3.2 NO-BREAK SPACE (NBSP)

A graphic character [...]

6.3.3 SOFT HYPHEN (SHY)

A graphic character [...]

[...]

7 Specification of the coded character set

This part of ISO 8859 specifies 191 characters allocated to
the bit combinations of the code table (table 2). None of
these characters are "non-spacing".

[...]

7.2 Code table

[...] The shaded positions [decimal 0-31 and 127-159]
correspond to bit combinations that do not represent graphic
characters. Their use is outside the scope of ISO 8879; it
is specified in other International Standards, for example
ISO 646 or 6429.

I propose that the last sentence of the paragraph quoted above be
deleted and that the second to last sentence be amended to read
"Latin-1 comprises 191 graphic characters, including the alphabets of
most Western European languages."

| In SGML applications, the use of control characters is limited
| in order to maximize the chance of successful interchange over
| heterogeneous networks and operating systems. In HTML, only
| three control characters are allowed: Horizontal Tab, Carriage
| Return, and Line Feed (code positions 9, 13, and 10 in
| [ISO-8859-1]).

ISO 8859 explicitly does not define the control characters; see the
passage from 8859 quoted above. The control characters are not part
of Latin-1. Latin-1 consists only of graphic characters.

There is nothing to prevent the HTML specification from designating
the code points 9, 13, and 10 as horizontal tab, carriage return, and
line feed. But they are not part of 8859 and are not part of the
Latin-1 character set. I propose that the last sentence of this
paragraph be amended to read "In HTML, only three control characters
are allowed: Horizontal Tab, Carriage Return, and Line Feed (code
positions 9, 13, and 10)."

| * The horizontal tab character (code position 9 in
| [ISO-8859-1]) must be interpreted as the smallest positive
| nonzero number of spaces which will leave the number of
| characters so far on the line as a multiple of 8.

ISO 8859 does not define a horizontal tab character. I propose that
the sentence above be amended to read "The horizontal tab character
(code position 9) must be interpreted ..."

|12. The ISO 8859-1 Coded Character Set
|
| This list, sorted numerically, is derived from [ISO-8859-1].
|
| REFERENCE DESCRIPTION
| � -  Unused
| 	 Horizontal tab
| 
 Line feed
|  -  Unused
| 
 Carriage Return
|  -  Unused
|   Space [etc.]

Everything up to 32 on this list is derived from someplace other
than 8859 and is not part of Latin-1, which consists solely of graphic
characters.

The list can be kept exactly the way it is if the title is changed to
something like "12. The HTML Basic Coded Character Set" (or whatever
this group thinks is appropriate) and all references to it changed
accordingly. Alternatively, the part of the list from 00 through 31
can be put into a separate list entitled something like "Control
Characters for HTML". The list from 32 through 255 is correctly named
"The ISO 8859-1 Coded Character Set" and can therefore be left as is
if the section from 00 through 31 is moved into another list.

Jon

========================================================================
Jon Bosak, Novell Corporate Publishing Services jb@novell.com
2180 Fortune Drive, San Jose, CA 95131 Fax: 408 577 5020
A sponsor of the Davenport Group (ftp://ftp.ora.com/pub/davenport/)
------------------------------------------------------------------------
The Library is a sphere whose consummate center is any hexagon, and
whose circumference is inaccessible. -- Jorge Luis Borges
========================================================================