Bytes and Characters [Was: HTML 2.0 comments (Second of two) ]

Daniel W. Connolly (connolly@hal.com)
Wed, 23 Nov 94 16:38:04 EST

In message <199411231857.OAA14615@postman.osf.org>, "Sandra Martin O'Donnell" w
rites:
>
>Section 2.6
>What units do values for attributes like MAXLENGTH and SIZE
>use? Are they numbers of bytes? The spec needs to provide that
>information. Actually, I suspect you currently assume these
>attributes are for numbers of characters, but this is incorrect
>because characters are variable (they can consume varying numbers
>of bytes), while bytes are static.
>
>Section 3.3
>The description of an SGML declaration says that names are a
>maximum of 72 characters, but that should be 72 bytes. As noted,
>characters are variable; bytes aren't.

I don't understand the conjecture "characters are variable, while
bytes are static."

Perhaps you mean that, e.g., that the byte-length of the UTF-8
encoding of a string doesn't vary linearly with number of characters
in the string. That doesn't make it any less precise to specify
lengths in characters.

This is an interesting issue, and the spec doesn't really make it
clear: HTML is two abstractions at once: an SGML application, defined
in terms of characters, and a MIME content type, defined in terms of
bytes.

The link is the assumed/missing/controversial "charset" parameter
which specifies how you take a MIME body of type text/html, that is, a
sequence of bytes, and translated it into an SGML entity, that is, a
sequence of characters.

In HTML 2.0, the charset parameter is (implicitly) "iso-latin-1" which
has a well-defined meaning in both the MIME and SGML camps.

The "HTML and MIME" and/or "HTML and SGML" sections should make this
clear, I suppose.

If I had my druthers, though, we sould cite the MIME and SGML specs as
normative references, provide the DTD and the MIME Content-Type
registration info, and be done with it. These terms are defined quite
nicely in the respective documents. It's really painful to reproduce the
SGML specification and the MIME specification in this HTML document.

Call me a minimalist.

Dan