Comment about draft-ietf-html-spec-01.txt

Kari E. Hurtta (
Fri, 17 Feb 95 15:15:22 EST

Comments about draft-ietf-html-spec-01.txt:

| 1.1.1 Document Structure Elements

| Body

| Example of Document Structure Elements
| <HTML>
| <HEAD>
| <TITLE>The Document's Title</TITLE>
| </HEAD>
| <BODY>
| The document's text.
| </BODY>

Ending </HTML> is missing. I think that in this example it sh=
ould not
be omitted.


About charset -parameter in =09Content-type: text/html; chars=

| 1.1.8 Character Data in HTML

| HTML documents are encoded in some character encodi=
| the character encoding may be specified, for exampl=
| by the "charset" parameter associated with the "tex=
| media type.
| Independent of the character encoding used,
| HTML also allows references to any of the ISO Latin=
| alphabet, using the names in the table ISO Latin-1
| Character Representations, which is derived from IS=
| Standard 8879:1986//ENTITIES Added Latin 1//EN. For
| details, see 2.17.2.

| 2.4 HTML as an Internet Media Type

| Charset

| The charset parameter (as defined in section 7.1.1 =
| RFC 1521) may be used with the text/html to specify
| the encoding used to represent the HTML document as
| a sequence of bytes. Normally, text/* media types
| specify a default value of US-ASCII for the charset
| parameter. However, for text/html, if the byte stre=
| contains data that is not in the 7-bit US-ASCII set=
, the
| HTML interpreting agent should assume a default cha=
rset of
| ISO-8859-1.

| When an HTML document is encoded using US-ASCII,
| the mechanisms of numeric character references (see
| Section 2.16.2) and character entity references (se=
| Section 2.16.3) may be used to encode additional ch=
| from ISO-8859-1.

| Other values for the charset parameter are not defi=
| in this specification, but may be specified in futu=
| levels or versions of HTML.

| It is envisioned that HTML will use the charset par=
| to allow support for non-Latin characters such as
| Greek, Arabic, Hebrew, Japanese, rather than relyin=
g on
| any SGML mechanism for doing so.

This document don't specify what to do when charset is not US=
ISO-8859-1. I think that two issue should be solved:


Is HTML tags interpreted with US-ASCII or ISO-8859-1 even whe=
n charset isn't
superset of US-ASCII? I think that it should. Compare what RF=
C 1563 says
about text/enriched:

! Non-ASCII character sets
! If the character set specified by the charset parameter o=
n the
! Content-type line is anything other than "US-ASCII", this=
means that
! the text being described by text/enriched formatting comm=
ands is in a
! non-ASCII character set. However, the commands themselve=
s are still
! the same ASCII commands that are defined in this document=
. This
! creates an ambiguity only with reference to the "<" chara=
cter, the
! octet with numeric value 60. In single byte character se=
ts, such as
! the ISO-8859 family, this is not a problem; the octet 60 =
can be
! quoted by including it twice, just as for ASCII. The pro=
blem is more
! complicated, however, in the case of multi-byte character=
sets, where
! the octet 60 might appear at any point in the byte sequen=
ce for any
! of several characters.

Both Text/enriched and Text/Html are same kind markup languag=
es for MIME, so
I think that they should be same feature in this respect.


How is Numeric Charater References interpreted when charset -=
is not ISO-8859-1 (or US-ASCII)? I think that they still shou=
ld interpret
according of ISO-8859-1.=20


If we say that Numeric Charater References are interpret acco=
rding of
charset mentioned in charset paramater, we lead conflict when
charset=3DUS-ASCII -- this document however says that they sh=
ould ineterpret
according of Latin/1 (and gives table for them).

And also it is conflict then with text:

| 2.16.3 Numeric Character References
| In addition to any mechanism by which characters ma=
y be
| represented by the encoding of the HTML document, i=
t is
| possible to explicitly reference the printing chara=
cters of
| the ISO-8859-1 character encoding using a numeric c=
| reference. See Section
| 2.17.1 for a list of the characters, their names an=
| input syntax.


| 2.17.3 Numerical Character References

| &#127; - &#160; Unused
| &#161; Inverted exclamation

| &#172; Not sign
| &#173; Soft hyphen
| &#174; Registered trademark

160 isn't unused. It is Non-breaking space. There should be
Non-breaking space in table or Soft hyphen should also be omi=
from table.

Compare text in earlier:

| 2.16 Character Data

| No. 1, or simply Latin-1. Latin-1 includes characters=
from most
| Western European languages, as well as a number of con=
| characters. Latin-1 also includes a non-breaking spac=
e, a soft
| hyphen indicator, 93 graphical characters, 8 unassigne=
| characters, and 25 control characters.
| Because non-breaking space and soft hyphen indicator a=
| not recognized and interpreted by all HTML user agents=
| their use is discouraged.

So both of Non-breaking space and Soft hyphen should be=20
Numeric character Reference table or both should be omitted.

- Kari E. Hurtta / El=E4m=E4 on =
Kari.Hurtta@Fmi.FI=09=09=09 puh. (90) 1929 658