Re: New draft: charset, conformance cleanup

Francois Yergeau (yergeau@alis.ca)
Tue, 4 Apr 95 09:27:00 EDT

>Date: Fri, 31 Mar 95 15:08:36 EST
>From: connolly@w3.org (Dan Connolly)
>
>2.1 SGML Documents
>
> [5 paragraphs down]
>
> By application convention, the SGML declaration is the one given in
> section 13.2. Hence the document character set is ISO-8859-1(@@)
> and the markup "*" represents an asterisk character.

I'm not sure I agree with that. If it means "by default", it's OK,
but otherwise it excludes anything but ISO-8859-1, which is
unacceptable.

>3.1 text/html media type
>
> Charset
> The charset parameter (as defined in section 7.1.1 of RFC
> 1521 [4]) may be given to specify the encoding used to represent
> the HTML document as a sequence of octets. The default value is
> out of scope of this specification; but for example, it is
> US-ASCII in the context of MIME mail, and ISO-8850-1 in the
> context of HTTP.

Is it a good thing to have different defaults depending on the mode of
transmission? What if I store an HTML doc. on disk and forget how I
got it? I think the default for HTML has been ISO-8859-1 since the
beginning, and that the spec should simply say so. It can be
MIME-encoded in mail if necessary.

>3.2 HTML Document Represenation
>
> A MIME entity with a content type of "text/html" represents an HTML
> document, consisting of a single text entity. The charset parameter
> (whether implicit or explicit) identifies a character encoding. The
> text entity consists of the characters determined by this character
> encoding and the octets of the body of the MIME entity.
>
> The SGML declaration of the document is a function of the charset
> parameter. If the charset parameter is US-ASCII or ISO-8859-1, the
> SGML declaration in section 13@@ applies. Other charset parameter
^^^^^^^^^^^^^^^^^^^^^^^
> values are reserved for future use.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I can't believe we're coming back to such language. It's too late to
reserve other charset values: they're in wide use already, and have
been for a while.

This sentence needs to go, and be replaced by the former language that
said that for other charsets, the SGML was to be minimally modified.

> NOTE: A generalized convention for mapping charset parameter values
> to SGML declarations is expected to be specified in a future
> version of this specification.

Good.

BTW, there are two subsections numbered 3.2.1

>6.1 The ISO Latin 1 Character Repertiore
>
> Conforming HTML user agents are required to support the US-ASCII
> [10] or ISO-8859-1 [11] character encodings, and the @@fullname ISO
> Latin 1 document character set.

Are *minimally* conformant UAs required to support Latin-1, or just
any single charset?

Perhaps charset requirements should be spelled out in section 1.3
(Terminology) for "conforming HTML user agent" and "minimally
conformant...". Surely we don't want a conforming UA to be forced to
support all charsets.

>12.3 SGML Declaration for HTML
>
> This is the SGML Declaration for HyperText Markup Language (HTML)
> as used by the World Wide Web (WWW) application:

..for documents encoded in ISO-8859-1. Documents encoded in other
character sets should use an SGML declaration as close as possible to
this one, in order to preserve SGML conformance.

-- 
François Yergeau <yergeau@alis.ca>