Last call: Intro, SGML, MIME sections

Dan Connolly (connolly@w3.org)
Thu, 4 May 95 08:32:52 EDT

I've finally got some reasonable tools in place, and my latest
edits integrated into the HTML 2.0 spec.

I've updated the W3C HTML stuff[1] to point to the latest
hypertext draft[2], plus postscript and text versions, and
a tar file of the whole shebang, also available via ftp[3].

[1] http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html

[2] Hypertext Markup Language - 2.0 - Table of Contents
Thu May 4 11:46:54 1995
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_toc.html

[3] ftp://ftp.w3.org/pub/www/doc/html-spec-19950504.tar.gz

I'd like to conduct a somewhat focused, section by section review.

For the first part, I'll take comments on general stuff:

* IETF formatting viloations (I know I've got some)
in the text and postscript.

* Hypertext structure. How do you like the glossary?
Are the links sprinkled throughout too much noise?

But I'm most concerned with the first three sections right now:

Introduction
Scope
Conformance
Documents
User Agents
HTML as an Application of SGML
SGML Documents
HTML Lexical Syntax
Data Characters
Tags
Names
Attributes
Comments
Example HTML Document
HTML as an Internet Media Type
text/html media type
HTML Document Representation
Undeclared Markup Error Handling
Conventional Representation of Newlines
Security Considerations

To facilitate blow-by-blow responses, diffs, and the like, here's the
text of the first three sections. A light 17 pages of reading :-}

I'm really struggling with what to say about levels and conformance.
What does it really mean for alanguage construct to be level 0 vs
level 1 vs level 2? What is useful about a "level 0" document? What
about a "level 1 user agent"? I suppose the distinction about forms
support is worth making: the level 1 features just disappear on old
browsers, but forms stuff appears broken. For tables and math and
format negociation, I suppose it's worth keeping level around.

I did add some discussion about HTML.Recommended and HTML.Deprecated,
since I got a certain amount of feedback about that stuff.

I'm also struggling with nroff... :-{ Folks with nifty SGML
typesetting systems are encouraged to grab the sgml source
(it's in a LaTeX-like DTD called snafu, by Gary Houston)
and make nice internet-draft style text versions for me.
Or perhaps you can take the TeXinfo version and do better
than texi2roff does...

HTML Working Group T. Berners-Lee
INTERNET-DRAFT D. Connolly
<draft-ietf-html-spec-03.txt> MIT/W3C
Expires in six months May 4, 1995

Hypertext Markup Language 2.0

Status of this Memo

This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, and
its working groups. Note that other groups may also distribute working
documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents at
any time. It is inappropriate to use Internet-Drafts as reference mate-
rial or to cite them other than as ``work in progress.''

To learn the current status of any Internet-Draft, please check the
1id-abstracts.txt listing contained in the Internet-Drafts Shadow Direc-
tories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au
(Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).

Distribution of this document is unlimited. Please send comments
to the HTML working group (HTML-WG) of the Internet Engineering Task
Force (IETF) at html-wg@oclc.org. Discussions of the group are archived
at http://www.acl.lanl.gov/HTML_WG/archives.html.

Abstract

The Hypertext Markup Language (HTML) is a simple markup language
used to create hypertext documents that are platform independent. HTML
documents are SGML documents with generic semantics that are appropriate
for representing information from a wide range of domains. HTML markup
can represent hypertext news, mail, documentation, and hypermedia; menus
of options; database query results; simple structured documents with in-
lined graphics; and hypertext views of existing bodies of information.

HTML has been in use by the World Wide Web (WWW) global information
initiative since 1990. This specification roughly corresponds to the
capabilities of HTML in common use prior to June 1994. HTML is an appli-
cation of ISO Standard 8879:1986 Information Processing Text and Office
Systems; Standard Generalized Markup Language (SGML).

Berners-Lee, Connolly FORMFEED[Page 1]

INTERNET DRAFT May 1985

The "text/html; version=2.0" Internet Media Type (RFC&nbsp;1590)
and MIME Content Type (RFC&nbsp;1521) is defined by this specification.

Berners-Lee, Connolly FORMFEED[Page 2]

INTERNET DRAFT May 1985

1. Introduction The HyperText Markup Language (HTML) is a simple data
format used to create hypertext documents that are portable from one
platform to another. HTML documents are SGML documents with generic
semantics that are appropriate for representing information from a wide
range of domains.

1.1. Scope HTML has been in use by the World-Wide Web (WWW) global
information initiative since 1990. This specification corresponds to
the capabilities of HTML in common use prior to June 1994 and referred
to as ``HTML 2.0''.

HTML is an application of ISO Standard 8879:1986 Information Pro-
cessing Text and Office Systems; Standard Generalized Markup Language
(SGML). The HTML Document Type Definition (DTD) is a formal definition
of the HTML syntax in terms of SGML.

This specification also defines HTML as an Internet Media
Type[IMEDIA] and MIME Content Type[MIME] called `text/html', or
`text/html; version=2.0'. As such, it defines the semantics of the HTML
syntax and how that syntax should be interpreted by user agents.

1.2. Conformance This specification governs the syntax of HTML docu-
ments and the behaviour of HTML user agents.

Documents A document is a conforming HTML document only if:

o It is a conforming SGML document.

o It conforms to the application conventions in this specification.
For example, the value of the `HREF' attribute of the `A' element
must conform to the URI syntax.

There
-----------
There are a number of syntactic idioms that are not sup-
ported or are supported inconsistently in some historical
user agent implementations. These idioms are called out in
notes like this throughout this specification.
HTML documents should not contain these idioms, at least

Berners-Lee, Connolly FORMFEED[Page 3]

INTERNET DRAFT May 1985

The HTML DTD defines a standard HTML document type and several
variations, based on feature test entities:

HTML.Recommended
Certain features of the language are necessary for compatibility
with widespread usage, but they may compromise the structural
integrity of a document. This feature test entity enables a more
prescriptive document type definition that eliminates those fea-
tures.

For example, in order to preserve the structure of a document, an
editing user agent may translate HTML documents to the recommended
subset, or it may require that the documents be in the recommended
subset for import.

HTML.Deprecated
Certain features of the language are necessary for compatibility
with earlier versions of the specification, but they tend to be
used an implemented inconsistently, and their use is deprecated.
This feature test entity enables a document type definition that
eliminates these features.

Documents generated by tranlation software or editing software
should not contain these idioms.

User Agents An HTML user agent conforms to this specification if:

o It parses the characters of an HTML document into data characters
and markup as per [SGML].

o It behaves identically for documents whose parsed token sequences
are identical.

For example, comments and the whitespace in tags disappear during
tokenization, and hence they do not influence the behaviour of con-
forming user agents.

-----------
until such time as support for them is widely deployed.

Berners-Lee, Connolly FORMFEED[Page 4]

INTERNET DRAFT May 1985

o It allows the user to traverse (or at least attempt to traverse,
resources permitting) all hyperlinks in an HTML document.

o It allows the user to express all form field values specified in an
HTML document and to (attempt to) submit the values as requests to
information services.

In

@@Levels?

-----------
In the interest of robustness and extensibility, there are a
number of widely deployed conventions for handling non-
conforming documents. See `Undeclared Markup Error Han-
dling' for details.

Berners-Lee, Connolly FORMFEED[Page 5]

INTERNET DRAFT May 1985

2. HTML as an Application of SGML

HTML is an application of ISO Standard 8879:1986 - Standard Gener-
alized Markup Language (SGML). SGML is a system for defining structured
document types and markup languages to represent instances of those doc-
ument types[SGML]. The public text -- DTD and SGML declaration -- of
the HTML document type definition are provided in `HTML Public Text'.

The term HTML refers to both the document type defined here and the
markup language for representing instances of this document type.

2.1. SGML Documents An HTML document is an SGML document; that is, a
set of entities, including the document entity, which is text entity in
which parsing begins. The first production of the SGML grammar sepa-
rates an SGML document into three parts: an SGML declaration, a pro-
logue, and an instance.

For the purposes of this specification, the prologue is a DTD.
This DTD describes another grammar: the start symbol is given in the
doctype declaration; the terminals are data characters and tags, and
the productions are determined by the element declarations. The
instance must conform to the DTD, that is, it must be in the language
defined by this grammar.

The SGML declaration determines the lexicon of the grammar. It
specifies the document character set, which determines a character
repertoire that contains all characters that occur in all text entities
in the document, and the character numbers associated with those charac-
ters.

The SGML declaration also specifies the syntax character set of the
document, and a few other parameters that bind the abstract syntax of
SGML to a concrete syntax. This concrete syntax determines how each
text entity is mapped to a sequence of terminals in the grammar of the
prologue.

For example, consider the following document:

<!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<title>Parsing Example</title>
<p>Some text. <em>&#42;wow&#42;</em>

Berners-Lee, Connolly FORMFEED[Page 6]

INTERNET DRAFT May 1985

An HTML user agent should use the SGML declaration is given in
`SGML Declaration for HTML'. It specifies ISO-8859-1 as the document
character set, so that the markup `&#42;' represents an asterisk charac-
ter.

The instance above is regarded as the following sequence of termi-
nals:

1. TITLE start tag

2. data characters: ``Parsing Example''

3. TITLE end tag

4. P start tag

5. data characters ``Some text. ''

6. EM start tag

7. ``*wow*''

8. EM end tag

The start symbol of the DTD grammar is HTML, and the productions
are given in the public text identified by `-//IETF//DTD HTML 2.0//EN'
(`HTML DTD'). Hence the terminals above parse as:

HTML
|
\-HEAD, BODY
| |
\-TITLE \-P
| |
| \-<P>,"Some text. ",EM
| |
| \-<EM>,"*wow*",</EM>
\-<TITLE>,"Parsing Example",</TITLE>

Berners-Lee, Connolly FORMFEED[Page 7]

INTERNET DRAFT May 1985

2.2. HTML Lexical Syntax The syntax character set for all HTML docu-
ments is ISO-646-IRV. A minimally conforming HTML user agent must sup-
port the SGML declaration in `SGML Declaration for HTML', which speci-
fies ISO Latin 1 (@@full name) as the document character set; it may
support other SGML declarations, in particular, SGML declarations with
other document character sets.

A complete discussion of SGML parsing, e.g. the mapping of a
sequence of characters to a sequence of tags and data is left to the
SGML standard[SGML]. This section is only a summary.

Data Characters Any sequence of characters that do not constitute markup
(see ``Delimiter Recognition,'' section @@@ of [SGML]) are mapped
directly to strings of data characters. Some markup also maps to data
character strings. Numeric character references also map to single-
character strings, via the document character set. Each reference to
one of the general entities defined in the HTML DTD also maps to a sin-
gle-character string.

For example,

abc&lt;def => "abc","<","def"
abc&#60;def => "abc","<","def"

Note that the terminating semicolon is only necessary when the
character following the reference would otherwise be recognized as
markup:

abc &lt def => "abc ","<"," def"
abc &#60 def => "abc ","<"," def"

And note that an ampersand is only recognized as markup when it is
followed by a letter or number:

abc & lt def => "abc & lt def"
abc & 60 def => "abc & 60 def"

Berners-Lee, Connolly FORMFEED[Page 8]

INTERNET DRAFT May 1985

A useful technique for translating plain text to HTML is to replace
each '<', '&', and '>' by an entity reference or numeric character ref-
erence as follows:

ENTITY NUMERIC
CHARACTER REFERENCE CHAR REF CHARACTER DESCRIPTION
& &amp; &#38; Ampersand
< &lt; &#60; Less than
> &gt; &#62; Greater than

There

Tags Tags delimit elements such as headings, paragraphs, lists, charac-
ter highlighting and links. Most HTML elements are identified in a doc-
ument as a start tag, which gives the element name and attributes, fol-
lowed by the content, followed by the end tag. Start tags are delimited
by `<' and `>'; end tags are delimited by `</' and `>'. An example is:

<H1>This is a Heading</H1>

Some elements only have a start tag without an end tag. For exam-
ple, to create a line break, you use the `<BR>' tag. Additionally, the
end tags of some other elements, such as Paragraph (`</P>'), List Item
(`</LI>'), Definition Term (`</DT>'), and Definition Description
(`<DD>') elements, may be omitted.

The content of an element is a sequence of data character strings
and nested elements. Some elements, such as anchors, cannot be nested.
-----------
There are SGML mechanisms, CDATA and RCDATA, to allow most
`<', `>', and `&' characters to be entered without the use
of entity references. Because these features tend to be
used and implemented inconsistently, and because they con-
flict with techinques for reducing HTML to 7 bit ASCII for
transport, they are not used in this version of the HTML
DTD.

Berners-Lee, Connolly FORMFEED[Page 9]

INTERNET DRAFT May 1985

Anchors and character highlighting may be put inside other constructs.
See the HTML DTD, `HTML DTD' for full details. The

Names A name consists of a letter followed by up to 71 letters, digits,
periods, or hyphens. Element names are not case sensitive, but entity
names are. For example, `<BLOCKQUOTE>', `<BlockQuote>', and `<block-
quote>' are equivalent, whereas `&amp;' is different from `&AMP;'.

In a start tag, the element name must immediately follow the tag
open delimiter `<'.

Attributes In a start tag, white space and attributes are allowed
between the element name and the closing delimiter. An attribute typi-
cally consists of an attribute name, an equal sign, and a value, though
some attributes may be just a value. White space is allowed around the
equal sign.

The value of the attribute may be either:

o A string literal, delimited by single quotes or double quotes and
not containing any occurrences of the delimiting character.

o A name token (a sequence of letters, digits, periods, or hyphens)

In this example, img is the element name, `src' is the attribute
name, and `http://host/dir/file.gif' is the attribute value:

<img src="http://host/dir/file.gif">

Some
-----------
The SGML declaration for HTML specifies SHORTTAG YES, which
means that there are other valid syntaxes for tags, such as
NET tags, `<EM/.../'; empty start tags, `<>'; and empty
end tags, `</>'. Until support for these idioms is widely
deployed, their use is strongly discouraged.
Some historical implementations consider any occurrence of
the `>' character to signal the end of a tag. For ompati-

Berners-Lee, Connolly FORMFEED[Page 10]

INTERNET DRAFT May 1985

A useful technique for computing an attribute value literal for a
given string is to replace each quote and space character by an entity
reference or numeric character reference as follows:

ENTITY NUMERIC
CHARACTER REFERENCE CHAR REF CHARACTER DESCRIPTION
TAB &#9; Tab
LF &#10; Line Feed
CR &#13; Carriage Return
&#32; Space
" &quot; &#34; Quotation mark
& &amp; &#38; Ampersand

For example:

<IMG SRC="image.jpg" alt="First &quot;real&quot; example">

Some

Note that the SGML declaration in section 13.3 limits the length of
an attribute value to 1024 characters.

Attributes such as ISMAP and COMPACT, may be written using a mini-
mized syntax. The markup:

<UL COMPACT="compact">

-----------
bility with such implementations, when `>' appears in an
attribute value, it should be represented with a numeric
character reference, such as in: `<IMG SRC="eq1.jpg"
alt="a>b">'.
Some historical implementations allow any character except
space or `>' in a name token. Attributes values must be
quoted only if they don't satisfy the syntax for a name
token.

Berners-Lee, Connolly FORMFEED[Page 11]

INTERNET DRAFT May 1985

can be written using a minimized syntax:

<UL COMPACT>

Some

Comments To include comments in an HTML document that will be eliminated
in the mapping to terminals, surround them with `'. After the comment
delimiter, all text up to the next occurrence of `-->' is ignored.
Hence comments cannot be nested. White space is allowed between the
closing `--' and `>', but not between the opening `<!' and `--'.

For example:

<HEAD>
<TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- $Id: HTML.txt,v 1.14 1995/04/26 16:07:52 connolly Exp $ -->
</HEAD>

Some

Example HTML Document

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HTML>
<!-- Here's a good place to put a comment. -->
<HEAD>
<TITLE>Structural Example</TITLE>
</HEAD><BODY>
-----------
Some historical implementations only understand the mini-
mized syntax.
Some historical HTML implementations incorrectly consider
any `>' character to be the termination of a comment.

Berners-Lee, Connolly FORMFEED[Page 12]

INTERNET DRAFT May 1985

<H1>First Header</H1>
<P>This is a paragraph in the example HTML file. Keep in mind
that the title does not appear in the document text, but that
the header (defined by H1) does.</P>
<OL>
<LI>First item in an ordered list.
<LI>Second item in an ordered list.
<UL COMPACT>
<LI> Note that lists can be nested;
<LI> Whitespace may be used to assist in reading the
HTML source.
</UL>
<LI>Third item in an ordered list.
</OL>
<P>This is an additional paragraph. Technically, end tags are
not required for paragraphs, although they are allowed. You can
include character highlighting in a paragraph. <EM>This sentence
of the paragraph is emphasized.</EM> Note that the &lt;/P&gt;
end tag has been omitted.
<P>
<IMG SRC ="triangle.xbm" alt="Warning:">
Be sure to read these <b>bold instructions</b>.
</BODY></HTML>

Berners-Lee, Connolly FORMFEED[Page 13]

INTERNET DRAFT May 1985

3. HTML as an Internet Media Type An HTML user agent allows users to
interact with resources which have HTML representations. At a minimum,
it must allow users to examine and navigate the content of HTML docu-
ments. HTML user agents should be able to preserve all formatting dis-
tinctions represented in an HTML document, and be able to simultaneously
present resources referred to by IMG elements. (they may ignore some
formatting distinctions or IMG resources at the request of the user).
Conforming HTML user agents should support form entry and submission.

3.1. text/html media type

This specification defines the Internet Media Type[IMEDIA] (for-
merly referred to as the Content Type[MIME]) called `text/html'. The
following is to be registered with [IANA].

Media Type name
text

Media subtype name
html

Required parameters
none

Optional parameters
version, charset

Encoding considerations
any encoding is allowed

Security considerations
see `Security Considerations'

The optional parameters are defined as follows:

Version
To help avoid future compatibility problems, the version parameter

Berners-Lee, Connolly FORMFEED[Page 14]

INTERNET DRAFT May 1985

may be used to give the version number of the specification to
which the document conforms. The version number appears at the
front of this document and within the public identifier of the HTML
DTD. This specification defines version 2.0.

Charset
The charset parameter (as defined in section 7.1.1 of RFC
1521[MIME]) may be given to specify the character encoding scheme
used to represent the HTML document as a sequence of octets.

3.2. HTML Document Representation A message entity with a content type
of `text/html' represents an HTML document, consisting of a single text
entity. The `charset' parameter (whether implicit or explicit) identi-
fies a character encoding scheme. The text entity consists of the char-
acters determined by this character encoding and the octets of the body
of the message entity.

HTML user agents must support the ISO-8859-1 character encoding
scheme, and hence the US-ASCII character encoding scheme. HTML

Undeclared Markup Error Handling To facilitate experimentation and
interoperability between implementations of various versions of HTML,
the installed base of HTML user agents supports a superset of the HTML
2.0 language by reducing it to HTML 2.0: markup in the form of a start
tag or end tag whose generic identifier is not declared is mapped to
nothing during tokenization. Undeclared attributes are treated simi-
larly. The entire attribute specification of an unknown attribute
(i.e., the unknown attribute and its value, if any) should be ignored.
On the other hand, references to undeclared entities should be treated
as data characters.

For example:

<div class=chapter><h1>foo</h1><p>...</div>
-----------
HTML user agents are encouraged to support ISO10646 as a
document character set, and Unicode-1-1-UTF-8 and Uni-
code-1-1-UCS-2 as character encoding schemes. Other encod-
ings schemes such as ISO-2022-JP may be supported as well.

Berners-Lee, Connolly FORMFEED[Page 15]

INTERNET DRAFT May 1985

=> <H1>,"foo",</H1>,<P>,"..."
xxx <P ID=z23> yyy
=> "xxx ",<P>," yyy
Let &alpha; and &beta; be finite sets.
=> "Let &alpha; and &beta; be finite sets."

Support for notifying the user of such errors is encouraged.

Information providers should keep in mind that this convention is
not binding: unspecified behavior may result, as such markup is not
conforming to this specification.

Conventional Representation of Newlines SGML specifies that a text
entity is a sequence of records, each beginning with a record start
character and ending with a record end character (characters numbered 10
and 13 respectively). (@@cite a section)

MIME specifies that a body of type `text/*' is a sequence of lines,
each terminated by CRLF, that is octets 10, 13.

In practice, HTML documents are frequently represented and trans-
mitted using an end of line convention that depends on the conventions
of the source of the document; frequently, that representation consists
of CR only, LF only, or CR LF combination. Hence the decoding of the
octets will often result in a text entity with some missing record start
and record end characters.

Since there is no ambiguity, HTML user agents are encouraged to
infer the missing record start and end characters.

An HTML user agent should treat end of line in any of its varia-
tions as a word space in all contexts except preformatted text. Within
preformatted text, an HTML user agent should expect to treat any of the
three common representations of end-of-line as starting a new line.

3.3. Security Considerations Anchors, embedded images, and all other
elements which contain URIs as parameters may cause the URI to be deref-
erenced in response to user input. In this case, the security consider-
ations of the URI specification apply.

Documents may be constructed whose visible contents mislead the
reader to follow a link to unsuitable or offensive material.

Berners-Lee, Connolly FORMFEED[Page 16]

INTERNET DRAFT May 1985

>

Berners-Lee, Connolly FORMFEED[Page 17]