Re: HTML Display - Text Formatting, Background, and Graphics

James D Mason (MASONJD@oax.a1.ornl.gov)
Tue, 15 Nov 94 17:20:25 EST

I'm a newcomer to this mailing list, so I'm a bit hesitant to open my mouth
with what amounts to an essay, but I'm not known for staying silent for long,
and particularly not when it's about something I've experienced.

When I see the messages about associating formatting with HTML tags, I break
out in cold sweats: it's like reliving more than a decade or so ago, in the
very early days of SGML.

One of the cardinal principles we evolved with SGML was to keep a separation
between content and form. That is, we expected SGML files to deal only in
structural relationships, with processing semantics applied externally (it's
the old separation between "essence" and "accident" in Greek philosophy).

In practical applications, particularly ones that don't have a rich structural
repertoire, it's hard to keep a clean distinction. (HTML is an example of
this, with its deliberately, and wisely, _very_ small tag repertoire.) But at
least it's a goal--that's why we should lean towards generic "emphasis" tags
and shun those with hard-coded semantics like "bold" or "italic". Things also
get messy when we get into areas where we have become accustomed to confusing
visual presentation with underlying semantics, such as tables and equations.
It's hard to do a good job on those _and_ have a tagging scheme that humans
can use.

I've spent a lot of my life setting type, so I am very concerned with
controlling physical presentation of data. But I also recognize that with WWW
clients like Mosaic, we have entered a new era in which the user, rather than
the creator, of data has the power to control presentation. The change in
relationships between generator and consumer of information is going to take
some getting used to. (Actually it's not a new realtionship. Some of the
hardcore in the SGML world have opposed _any_ means of communicating
processing semantics to the consumer of data, but that's bordering on the
absurd. At the very least, we've tried to say that we shouldn't create data in
ways that constrain unforseen future applications, of the sort of the WWW
[impractical a decade ago, but not completely out of the picture, as I have
committee papers to show]. The best thinking in the SGML community has always
been that information should be captured so that it could be reused by
multiple applications.) Nonetheless, I'm uncomfortable with suggestions that
we do much more than is already in HTML to incorporate presentation semantics.

There may be some merit in (1) adding additional kinds of infomation
identification (read: element types/structures) in future versions of HTML and
(2) developing a means for associating style sheets -- either from the
authoring or the reading end -- with HTML documents.

That said, I'll comment that style sheets constitute a wormhole into
unspeakable universes. People start thinking they'll just set up a little file
in SGML or something else, and soon it grows uncontrolable. My own experience
with the FOSI process in the CALS program is adequate evidence. I started the
FOSI committee just to shut up some folks who were demanding print from an
application that should have been about electronic information interchange.
They fell into the "little SGML file" hole and haven't escaped yet (some of
the "little" files are 10,000+ lines). I promptly bailed out. They've kept on
adding all sorts of junk to FOSIs, and the software for interpreting FOSIs
gets more complex, and people keep doing so many ad hoc things in FOSIs that
the mess in uncontrollable. One of my associates, who is a planner for a major
program that is stuck with FOSIs, keeps feeding me an unending tale of woe
about how things have to be patched together by hand because FOSIs don't work.

On a subject on which I should perhaps defer to Yuri or Lee: the beta version
of SoftQuad's Panorama uses "a little SGML file", but SoftQuad doesn't want it
to stay that way. They'd rather go with DSSSL. That standard is truly complex,
but at least it has several years of thought in it (as well as watching the
sad history of FOSIs). I don't think I'd like to edit a DSSSL specification in
EMACS, but that's as much a reflection of how complex real-world documents are
as it is a reflection of how complex DSSSL is. I don't like to edit HyTime
code directly either. I don't even like to edit SGML code, and I've been doing
that since 1981. We need cheap tools to help with all such things. HTML, even
at the 2.0 level, is editable in EMACS only because it's a restricted
application. Moving from a couple of dozen tags to the 100+ tags of most
mature applications (and I expect that in future versions of HTML), makes
assisted editing almost mandatory for all but masochists.

Much of the complexity of DSSSL is in the mechanisms for walking the
structures of SSML instance files (e.g., finding all the contexts in which a
<title> tag can appear); the rest is mapping the results of the queries of
document structure onto processing semantics. For the present, the only
processing semantics we're standardizing are for formatting. The query
language, developed by James Clark of SGMLS fame, is based on the Scheme
dialect of LISP. DSSSL is out for a second DIS (Draft Internataional Standard)
ballot until January. If anyone is really eager to read the text (as if this
mailing list doesn't provide enough reading matter), the sources and a
PostScript rendition are at ftp://infosrv1.ctd.ornl.gov/pub/sgml/WG8/DSSSL/;
the DTD is in /pub/sgml/WG8/DTD/standard.dtd. We welcome comments on the DIS.

Dr. James D. Mason
ISO/IEC JTC1/SC18/WG8 Convenor
Oak Ridge National Laboratory
Information Management Services
Bldg. 2506, M.S. 6302, P.O. Box 2008
Oak Ridge, TN 37831-6302 U.S.A.
Telephone: +1 615 574-6973
Facsimile: + 1 615 574-6983
Network: masonjd@ornl.gov