Comment about draft-ietf-html-spec-01.txt

Kari E. Hurtta (Kari.Hurtta@fmi.fi)
Fri, 17 Feb 95 15:15:22 EST

Comments about draft-ietf-html-spec-01.txt:

#1:
| 1.1.1 Document Structure Elements

| Body

| Example of Document Structure Elements
|
| <HTML>
| <HEAD>
| <TITLE>The Document's Title</TITLE>
| </HEAD>
| <BODY>
| The document's text.
| </BODY>
|

Ending </HTML> is missing. I think that in this example it sh=
ould not
be omitted.

#2:

About charset -parameter in =09Content-type: text/html; chars=
et=3Dsomething

| 1.1.8 Character Data in HTML

| HTML documents are encoded in some character encodi=
ng;
| the character encoding may be specified, for exampl=
e,
| by the "charset" parameter associated with the "tex=
t/html"
| media type.
=20
| Independent of the character encoding used,
| HTML also allows references to any of the ISO Latin=
-1
| alphabet, using the names in the table ISO Latin-1
| Character Representations, which is derived from IS=
O
| Standard 8879:1986//ENTITIES Added Latin 1//EN. For
| details, see 2.17.2.

| 2.4 HTML as an Internet Media Type

| Charset

| The charset parameter (as defined in section 7.1.1 =
of
| RFC 1521) may be used with the text/html to specify
| the encoding used to represent the HTML document as
| a sequence of bytes. Normally, text/* media types
| specify a default value of US-ASCII for the charset
| parameter. However, for text/html, if the byte stre=
am
| contains data that is not in the 7-bit US-ASCII set=
, the
| HTML interpreting agent should assume a default cha=
rset of
| ISO-8859-1.

| When an HTML document is encoded using US-ASCII,
| the mechanisms of numeric character references (see
| Section 2.16.2) and character entity references (se=
e
| Section 2.16.3) may be used to encode additional ch=
aracters
| from ISO-8859-1.

| Other values for the charset parameter are not defi=
ned
| in this specification, but may be specified in futu=
re
| levels or versions of HTML.

| It is envisioned that HTML will use the charset par=
ameter
| to allow support for non-Latin characters such as
| Greek, Arabic, Hebrew, Japanese, rather than relyin=
g on
| any SGML mechanism for doing so.

This document don't specify what to do when charset is not US=
-ASCII or
ISO-8859-1. I think that two issue should be solved:

#2.1:

Is HTML tags interpreted with US-ASCII or ISO-8859-1 even whe=
n charset isn't
superset of US-ASCII? I think that it should. Compare what RF=
C 1563 says
about text/enriched:

! Non-ASCII character sets
!
! If the character set specified by the charset parameter o=
n the
! Content-type line is anything other than "US-ASCII", this=
means that
! the text being described by text/enriched formatting comm=
ands is in a
! non-ASCII character set. However, the commands themselve=
s are still
! the same ASCII commands that are defined in this document=
. This
! creates an ambiguity only with reference to the "<" chara=
cter, the
! octet with numeric value 60. In single byte character se=
ts, such as
! the ISO-8859 family, this is not a problem; the octet 60 =
can be
! quoted by including it twice, just as for ASCII. The pro=
blem is more
! complicated, however, in the case of multi-byte character=
sets, where
! the octet 60 might appear at any point in the byte sequen=
ce for any
! of several characters.

Both Text/enriched and Text/Html are same kind markup languag=
es for MIME, so
I think that they should be same feature in this respect.

#2.2:

How is Numeric Charater References interpreted when charset -=
parameter
is not ISO-8859-1 (or US-ASCII)? I think that they still shou=
ld interpret
according of ISO-8859-1.=20

Reasons?

If we say that Numeric Charater References are interpret acco=
rding of
charset mentioned in charset paramater, we lead conflict when
charset=3DUS-ASCII -- this document however says that they sh=
ould ineterpret
according of Latin/1 (and gives table for them).

And also it is conflict then with text:

| 2.16.3 Numeric Character References
|
| In addition to any mechanism by which characters ma=
y be
| represented by the encoding of the HTML document, i=
t is
| possible to explicitly reference the printing chara=
cters of
| the ISO-8859-1 character encoding using a numeric c=
haracter
| reference. See Section
| 2.17.1 for a list of the characters, their names an=
d
| input syntax.

#3:

| 2.17.3 Numerical Character References

| &#127; - &#160; Unused
|
| &#161; Inverted exclamation

| &#172; Not sign
| &#173; Soft hyphen
| &#174; Registered trademark

160 isn't unused. It is Non-breaking space. There should be
Non-breaking space in table or Soft hyphen should also be omi=
tted
from table.

Compare text in earlier:

| 2.16 Character Data

| No. 1, or simply Latin-1. Latin-1 includes characters=
from most
| Western European languages, as well as a number of con=
trol
| characters. Latin-1 also includes a non-breaking spac=
e, a soft
| hyphen indicator, 93 graphical characters, 8 unassigne=
d
| characters, and 25 control characters.
|
| Because non-breaking space and soft hyphen indicator a=
re
| not recognized and interpreted by all HTML user agents=
,
| their use is discouraged.

So both of Non-breaking space and Soft hyphen should be=20
Numeric character Reference table or both should be omitted.

--=20
- Kari E. Hurtta / El=E4m=E4 on =
monimutkaista
Kari.Hurtta@Fmi.FI=09=09=09 puh. (90) 1929 658
{hurtta,root,Postmaster}@dionysos.fmi.fi