Re: Displaying Control Characters

Daniel W. Connolly (connolly@beach.w3.org)
Tue, 18 Jul 95 22:31:13 EDT

In message <9507181655.ZM4531@dmg.west.ora.com>, "Terry Allen" writes:
>| Terry writes:
>|
>| >Anything not in the document charset should be ignored (try parsing
^^^^^^^^ any character, that is. Some octets --or sequences of
octets-- can't be meaningfully interpreted as any character, in some
character encoding schemes.

>| >with sgmls and see what the SGML parsing model produces).
^^^
more like "an SGML parsing model." There are a number of degrees
of freedom in the SGML spec which must be bound to concrete choices
in any implementation. sgmls makes one set of choices; web browsers
may make others without viloating the SGML spec.

>| I think an error should be indicated in some manner.

Yeah verily! I hope that HTML user agents change in this direction:
report errors; don't let them slide.

>I have to deplore attempts to give HTML a different parsing model
>than SGML, because I want to use SGML tools to model the behavior
>of user agents in general.

I agree in principle, but let's be careful about exactly which
language idioms and browser behaviors are:

* objectionable, though conforming to the SGML and HTML specs
(e.g. using <dd> to indent paragraphs)
* in viloation of the recomendations of HTML spec, but conforming
to the HTML and SGML specs
(e.g. marked sections, internal declaration subset)
* in viloation of the HTML spec, but conforming to SGML
(e.g. <input type=text> with no NAME attribute)
* in volation of HTML spec, orthogonal to SGML
(e.g. octet 161, 162, ... in a text/html body)
* in violation of the spirit SGML, i.e. the conventional
wisdom in document management, while still
conforming to SGML
(e.g. <b>, <i>)
* in viloation of SGML
(e.g. the way most browser handle single
quotes and >'s ala <img src='foo' alt=">gotcha!<"> )

Disregarding octets in a text/html body can be construed as an
error handling techique. So can aborting with an error in that case.

I don't have the SGML spec handy, but I gather it specifies behavior
of characters that don't appear in the document character set (though
I have a hard time understanding what it means for a character to
be in an SGML document and not in the document character set.)

Daniel W. Connolly "We believe in the interconnectedness of all things"
Research Associate, MIT/W3C PGP: EDF8 A8E4 F3BB 0F3C FD1B 7BE0 716C FF21
<connolly@w3.org> http://www.w3.org/hypertext/WWW/People/Connolly