proposed changes on 'character set' issues

Larry Masinter (masinter@parc.xerox.com)
Wed, 11 Jan 95 21:19:34 EST

To address the comments made at the meeting on character sets, I
started with the text version of the HTML draft, edited it, and am
sending proposed changes as diffs.

I'm uneasy that these changes are inadequate to actually resolve the
multiple views of MIME and SGML on character sets (insofar as the DTD
still describes the BASESET, yet the charset parameter might extend
the character repetoire of the document.) I wanted to get some
feedback on this level of changes, however, before attempting anything
more aggressive.

================================================================
diff -5c html-orig.txt html-revised.txt
*** html-orig.txt Tue Jan 10 03:48:34 1995
--- html-revised.txt Tue Jan 10 03:49:29 1995
***************
*** 45,62 ****
structured documents with in-lined graphics; and
hypertext views of existing bodies of information.

HTML has been in use by the World Wide Web (WWW) global
information initiative since 1990. This specification
! corresponds to the legitimate capabilities of HTML in
common use prior to June 1994. It is defined as an
application of ISO Standard 8879:1986 Information
Processing Text and Office Systems; Standard Generalized
! Markup Language (SGML). This specificiation is proposed
! as the Internet Media Type (RFC 1590) and MIME Content
! Type (RFC 1521) called "text/html", or "text/html;
! version=2.0".

Contents

Overview of HTML Specification........................ 1

--- 45,62 ----
structured documents with in-lined graphics; and
hypertext views of existing bodies of information.

HTML has been in use by the World Wide Web (WWW) global
information initiative since 1990. This specification
! roughly corresponds to the capabilities of HTML in
common use prior to June 1994. It is defined as an
application of ISO Standard 8879:1986 Information
Processing Text and Office Systems; Standard Generalized
! Markup Language (SGML).
!
! The "text/html; version=2.0" Internet Media Type (RFC 1590) and
! MIME Content Type (RFC 1521) is defined by this specification.

Contents

Overview of HTML Specification........................ 1

***************
*** 479,490 ****
Because of the way special characters are used in
marking up HTML text, character strings are used to
represent the less than (<) and greater than (>) symbols
and the ampersand (&) as shown in Section 2.17.1.

! Representing ISO Latin-1 Characters in HTML

HTML also allows references to any of the ISO Latin-1
alphabet, using the names in the table ISO Latin-1
Character Representations, which is derived from ISO
Standard 8879:1986//ENTITIES Added Latin 1//EN. For
details, see 2.17.2.
--- 479,501 ----
Because of the way special characters are used in
marking up HTML text, character strings are used to
represent the less than (<) and greater than (>) symbols
and the ampersand (&) as shown in Section 2.17.1.

! Representing Special Characters in HTML

+ HTML inherits both from SGML and from MIME in its description
+ of characters and character sets. The result is a small
+ amount of duplication of function: there are multiple ways to
+ code characters in HTML.
+
+ HTML documents are encoded in some character encoding;
+ the character encoding may be specified, for example,
+ by the "charset" parameter associated with the "text/html"
+ media type.
+
+ Independent of the character encoding used,
HTML also allows references to any of the ISO Latin-1
alphabet, using the names in the table ISO Latin-1
Character Representations, which is derived from ISO
Standard 8879:1986//ENTITIES Added Latin 1//EN. For
details, see 2.17.2.
***************
*** 590,662 ****
specification with "deprecated" turned on. HTML user
agents generating HTML may in the spirit of
conservation, generate documents that conform to the
specification with the "recommended" sections turned on.

! 2.4 HTML and MIME

! The World Wide Web initiative (WWW) links information
! throughout the world. To do this, WWW uses the Internet
! Hypertext Transfer Protocol (HTTP), which allows
! transfer representations to be negotiated between client
! and server. Results are returned in a MIME body part.
!
! HTML is one of the representations used by WWW, and is
! proposed as a MIME content type. The definition of the
! HTML Content-Type is text/html, and has three optional
! parameters:

Level

The level parameter specifies the feature set used in
the document. The level is an integer number, implying
that any features of same or lower level may be present
! in the document. Levels are defined by this
specification.

Version

To help avoid future compatibility problems, the version
parameter may be used to give the version number of the
specification to which the document conforms. The
version number appears at the front of this document and
! within the public identifier for the SGML DTD.

! Character sets

! The charset parameter is reserved for future use. See
! Section 2.16 for a discussion of character sets and
! encodings in HTML.
!
! The actual character set used in the representation of
! an HTML document may be ISO 8859/1, or its 7-bit subset
! which is ISO 646. There is no obligation for an HTML
! document to contain any characters above decimal 127. It
! is possible that a transport medium such as electronic
! mail imposes constraints on the number of bits in a
! representation of a document, though the HTTP access
! protocol used by WWW always allows 8 bit transfer.

! When an HTML document is encoded using 7-bit characters,
! then the mechanisms of numeric character references (see
Section 2.16.2) and character entity references (see
! Section 2.16.3) may be used to encode characters in the
! upper half of the ISO 8859/1 Latin-1 set. In this way,
! documents may be prepared which are suitable for mailing
! through 7-bit limited systems.
!
! NOTE: ISO 646 is, for all intents and purposes,
! equivalent to the ANSI standard for ASCII (American
! Standard Code for Information Interchange). The only
! notable differences between the two standards are the
! names assigned to the control characters that occupy
! positions 00 through 31 and position 127 (decimal) in
! that encoding. For encoding HTML documents, only three
! control characters in ISO 646 or ASCII are relevant (see
! Section 2.16.2). These are Carriage Return (CR) at
! position 13, Line Feed (LF) at position 10, and
! Horizontal Tab (HT) at position 11.

2.5 Understanding HTML and SGML

HTML is an application of ISO Standard 8879:1986 -
Standard Generalized Markup Language (SGML). SGML is a
--- 601,661 ----
specification with "deprecated" turned on. HTML user
agents generating HTML may in the spirit of
conservation, generate documents that conform to the
specification with the "recommended" sections turned on.

! 2.4 HTML as an Internet Media Type

! This (and upward compatible specifications) define the Internet
! Media Type (RFC 1590) and MIME Content Type (RFC 1521) called
! "text/html".

+ The type "text/html" accepts the following parameters:
+
Level

The level parameter specifies the feature set used in
the document. The level is an integer number, implying
that any features of same or lower level may be present
! in the document. Levels 0, 1 and 2 are defined by this
specification.

Version

To help avoid future compatibility problems, the version
parameter may be used to give the version number of the
specification to which the document conforms. The
version number appears at the front of this document and
! within the public identifier for the SGML DTD. This
! specification defines version 2.0.

! Charset

! The charset parameter (as defined in section 7.1.1 of
! RFC 1521) may be used with the text/html to specify
! the encoding used to represent the HTML document as
! a sequence of bytes. Normally, text/* media types
! specify a default value of US-ASCII for the charset
! parameter. However, for text/html, if the byte stream
! contains data that is not in the 7-bit US-ASCII set, the
! HTML interpreting agent should assume a default charset of
! ISO-8859-1.

! When an HTML document is encoded using US-ASCII,
! the mechanisms of numeric character references (see
Section 2.16.2) and character entity references (see
! Section 2.16.3) may be used to encode additional characters
! from ISO-8859-1.
!
! Other values for the charset parameter are not defined
! in this specification, but may be specified in future
! levels or versions of HTML.
!
! It is envisioned that HTML will use the charset parameter
! to allow support for non-Latin characters such as
! Greek, Arabic, Hebrew, Japanese, rather than relying on
! any SGML mechanism for doing so.

2.5 Understanding HTML and SGML

HTML is an application of ISO Standard 8879:1986 -
Standard Generalized Markup Language (SGML). SGML is a
***************
*** 827,841 ****
NOTE: Some non-SGML implementations only understand the
minimized syntax.

2.6.4 Special Characters

! The characters between the tags represent text in the
! ISO-Latin-1 character set, which is a superset of ASCII.
! Because certain characters will be interpreted as
! markup, they should be represented by markup - entity or
! numeric character references. For more information, see
Section 2.16.

2.6.5 Comments

To include comments in an HTML document that will be
--- 826,839 ----
NOTE: Some non-SGML implementations only understand the
minimized syntax.

2.6.4 Special Characters

! Characters that are used to represent markup (such as
! ampersand (&), lesser (<) and greater (>)) should themselves
! be represented by markup, using either entity or numeric
! character references. For more information, see
Section 2.16.

2.6.5 Comments

To include comments in an HTML document that will be
***************
*** 1258,1268 ****
causes a paragraph break, and typically provides space
above and below the quote.

Single-font rendition may reflect the quotation style of
Internet mail by putting a vertical line of graphic
! characters , such as the greater than symbol (>), in the
left margin.

Example of use:

I think the poem ends
--- 1256,1266 ----
causes a paragraph break, and typically provides space
above and below the quote.

Single-font rendition may reflect the quotation style of
Internet mail by putting a vertical line of graphic
! characters, such as the greater than symbol (>), in the
left margin.

Example of use:

I think the poem ends
***************
*** 1713,1723 ****
may be used.

- Elements that define paragraph formatting
(headings, address, etc.) must not be used.

! - The ASCII horizontal tab character must be
interpreted as the smallest positive nonzero number of
spaces which will leave the number of characters so far
on the line as a multiple of 8. Its use is not
recommended however.

--- 1711,1722 ----
may be used.

- Elements that define paragraph formatting
(headings, address, etc.) must not be used.

! - The horizontal tab character (encoded in US-ASCII
! and ISO-8859-1 as decimal 9) must be
interpreted as the smallest positive nonzero number of
spaces which will leave the number of characters so far
on the line as a multiple of 8. Its use is not
recommended however.

***************
*** 2156,2192 ****

2.16 Character Data

Level 0

! The characters between HTML tags represent text encoded
! according to ISO 8859/1 8-bit single-byte coded graphic
! character set known as Latin Alphabet No. 1, or simply
! Latin-1. There are 256 character positions in the Latin-
! 1 encoding. Latin-1 includes characters from most
! Western European languages. It consists of the space
! character, 186 characters that form a subset of the
! graphic characters in ISO 6937/2 (1983), and four
! additional characters that are intended for inclusion in
! ISO 6937/2. Also see Section 2.4.
!
! The lower 128 character positions include a space, 33
! control characters, the 26 upper- and lowercase letters
! of the english alphabet, 10 numerals and 32 other
! printing characters This subset, functionally identical
! to ASCII, is defined by ISO 646 7-bit coded character
! set for information interchange, also known as the
! International Reference Version. ISO 646 is identical in
! most respect to the ANSI standard for ASCII (American
! Standard Code for Information Interchange). The only
! significant difference between ISO 646 and ASCII is the
! specific names assigned to the control characters in
! positions 00-31 and 127.
!
! The upper 128 positions include a non-breaking space, a
! soft hyphen indicator, 93 graphical characters, 8
! unassigned characters, and 25 control characters.
Because non-breaking space and soft hyphen indicator are
not recognized and interpreted by all HTML user agents,
their use is discouraged.

There are 58 character positions occupied by control
--- 2155,2176 ----

2.16 Character Data

Level 0

! The characters between HTML tags represent text. A HTML document
! (including tags and text) is encoded using the coded character
! set specified by the "charset" parameter of the "text/html"
! media type. For levels defined in this specification, the
! "charset" parameter is restricted to "US-ASCII" or "ISO-8859-1".
! ISO-8859-1 encodes a set of characters known as Latin Alphabet
! No. 1, or simply Latin-1. Latin-1 includes characters from most
! Western European languages, as well as a number of control
! characters. Latin-1 also includes a non-breaking space, a soft
! hyphen indicator, 93 graphical characters, 8 unassigned
! characters, and 25 control characters.
!
Because non-breaking space and soft hyphen indicator are
not recognized and interpreted by all HTML user agents,
their use is discouraged.

There are 58 character positions occupied by control
***************
*** 2196,2216 ****
Because certain special characters are subject to
interpretation and special processing, information
providers and HTML user agent implementors should follow
the guidelines in Section 2.16.1.

! Certain characters may not be accessible from your
! keyboard, or some part of your system (i.e. translation
! software) may not be equipped to deal with 8-bit
! character codes. HTML and many HTML user agents provide
character entity references (see Section 2.17.2) and
numerical character references (see Section 2.17.3) to
facilitate the entry and interpretation of characters by
name and by numerical position.

Because certain characters will be interpreted as
! markup, they must be represented by markup as described
in Section 2.16.3 and Section 2.16.4.

2.16.1 Special Characters

Certain characters have special meaning in HTML
--- 2180,2197 ----
Because certain special characters are subject to
interpretation and special processing, information
providers and HTML user agent implementors should follow
the guidelines in Section 2.16.1.

! In addition, HTML provides
character entity references (see Section 2.17.2) and
numerical character references (see Section 2.17.3) to
facilitate the entry and interpretation of characters by
name and by numerical position.

Because certain characters will be interpreted as
! markup, they must be represented by entity references as described
in Section 2.16.3 and Section 2.16.4.

2.16.1 Special Characters

Certain characters have special meaning in HTML
***************
*** 2242,2284 ****

In SGML applications, the use of control characters is
limited in order to maximize the chance of successful
interchange over heterogenous networks and operating
systems. In HTML, only three control characters are
! used. The valid control characters and their
! interpretation are:
!
! Horizontal Tab (HT - 9 dec)
!
! - Interpreted as a word space in all contexts except
! preformatted text.
!
! - Within preformatted text, the tab should be
! interpreted to shift the horizontal column position to
! the next position which is a multiple of 8 on the same
! line; that is, col := (col+8) mod 8
!
! Line Feed (LF - 10 dec)
!
! - Interpreted as a word space in all contexts except
! preformatted text.
!
! - Within the Preformatted Text element, the tab
! should be interpreted as a shift to the start of a new
! line; that is, col := 0; row := row+1
!
! Carriage Return (CR - 13 dec)
!
! - Interpreted as a word space in all contexts.

2.16.3 Numeric Character References

! Any printing character within the 8-bit character
! encoding of ISO 8859/1 (256 character positions) or the
! 7-bit character encoding of ISO 646 (128 character
! positions) may be represented within the text of an HTML
! document by a numeric character reference. See Section
2.17.1 for a list of the characters, their names and
input syntax.

Two reasons for using a numeric character reference:

--- 2223,2263 ----

In SGML applications, the use of control characters is
limited in order to maximize the chance of successful
interchange over heterogenous networks and operating
systems. In HTML, only three control characters are
! used: Horizontal Tab (HT, encoded as 9 decimal
! in US-ASCII and ISO-8859-1), Carriage Return, and
! Line Feed.
!
! Horizontal Tab is interpreted as a word space in all contexts
! except preformatted text. Within preformatted text, the tab
! should be interpreted to shift the horizontal column position
! to the next position which is a multiple of 8 on the same
! line; that is, col := (col+8) mod 8.
!
! Carriage Return and Line Feed are conventionally used
! to represent end of line. For Internet Media Types defined as
! "text/*", the sequence CR LF is used to represent an end of
! line. In practice, text/html documents are frequently
! represented and transmitted using an end of line convention
! that depends on the conventions of the source of the
! document; frequently, that representation consists of CR
! only, LF only, or CR LF combination. In HTML, end of line in
! any of its variations is interpreted as a word space in all
! contexts except preformatted text. Within preformatted text,
! HTML interpreting agents should expect to treat any of the
! three common representations of end-of-line as starting
! a new line.

2.16.3 Numeric Character References

! In addition to any mechanism by which characters may be
! represented by the encoding of the HTML document, it is
! possible to explicitly reference the printing characters of
! the ISO-8859-1 character encoding using a numeric character
! reference. See Section
2.17.1 for a list of the characters, their names and
input syntax.

Two reasons for using a numeric character reference:

***************
*** 2292,2311 ****

Numeric character references are represented in an HTML
document as SGML entities whose name is number sign (#)
followed by a numeral from 32-126 and 161-255. The HTML
DTD includes a numeric character for each of the
! printing characters in Latin-1, so that one may
! reference them by number if it is inconvenient to enter
them directly:

the ampersand (&#38;), double quotes (&#34;),
lesser (&#60;) and greater (&#62;) characters

2.16.4 Character Entities

! Many of the Latin alphabet No. 1 set of printing
characters may be represented within the text of an HTML
document by a character entity. See 2.17.2 for a list of
the characters, names, input syntax, and descriptions.
See 5.2.1 for the SGML entity definitions of "Added
Latin 1 for HTML".
--- 2271,2290 ----

Numeric character references are represented in an HTML
document as SGML entities whose name is number sign (#)
followed by a numeral from 32-126 and 161-255. The HTML
DTD includes a numeric character for each of the
! printing characters of the ISO-8859-1 encoding, so that one
! may reference them by number if it is inconvenient to enter
them directly:

the ampersand (&#38;), double quotes (&#34;),
lesser (&#60;) and greater (&#62;) characters

2.16.4 Character Entities

! In addition, many of the Latin alphabet No. 1 set of printing
characters may be represented within the text of an HTML
document by a character entity. See 2.17.2 for a list of
the characters, names, input syntax, and descriptions.
See 5.2.1 for the SGML entity definitions of "Added
Latin 1 for HTML".
***************
*** 2451,2462 ****
thorn &thorn; Small thorn, Icelandic
yuml &yuml; Small y, dieresis or umlaut mark

2.17.3 Numerical Character References

! This list, sorted numerically, is derived from ISO
! 8859/1 8-bit single-byte coded graphic character set:

REFERENCE DESCRIPTION

&#00; - &#08; Unused
&#09; Horizontal tab
--- 2430,2441 ----
thorn &thorn; Small thorn, Icelandic
yuml &yuml; Small y, dieresis or umlaut mark

2.17.3 Numerical Character References

! This list, sorted numerically, is derived from ISO-8859-1
! 8-bit single-byte coded graphic character set:

REFERENCE DESCRIPTION

&#00; - &#08; Unused
&#09; Horizontal tab
***************
*** 2711,2721 ****
- Line boundaries within the text are rendered as a
move to the beginning of the next line, except for one
immediately following a start tag or immediately
preceding an end tag.

! - The ASCII horizontal tab character must be
interpreted as the smallest positive nonzero number of
spaces which will leave the number of characters so far
on the line as a multiple of 8. Its use is not
recommended.

--- 2690,2700 ----
- Line boundaries within the text are rendered as a
move to the beginning of the next line, except for one
immediately following a start tag or immediately
preceding an end tag.

! - The horizontal tab character must be
interpreted as the smallest positive nonzero number of
spaces which will leave the number of characters so far
on the line as a multiple of 8. Its use is not
recommended.

***************
*** 2739,2749 ****
bold italic. This element is not widely supported.

4.2.2 Special Characters

To indicate special characters, HTML uses entity or
! numeric representations. Two additional character
presentations are proposed:

CHARACTER REPRESENTATION

Non-breaking space &nbsp;
--- 2718,2728 ----
bold italic. This element is not widely supported.

4.2.2 Special Characters

To indicate special characters, HTML uses entity or
! numeric representations. Additional character
presentations are proposed:

CHARACTER REPRESENTATION

Non-breaking space &nbsp;