Any opinions?
I started working on the problem of recognizing structure from HTML documents
after having implemented a system that did the same for TeX/LaTeX. I was
hoping that HTML would be easier to extract structure from.
Far from it, it's been a struggle. What's worse, most of the sgml tools seem
to be totally incomprehensible. Every DTD or specification document I read is
littered liberally with iso standard numbers, (which make no sense to me ) and
though I know I should not complain about surface syntax, I find the syntactic
presentation of DTD's extremely difficult to absorb.
Considering that an HTML document represents a fairly simple hierarchical
structure, why not start describing it as such?
This would make the task of writing parsers easier, and also
encourage good HTML.
Currently, the definition of valid HTML is so inaccessible even to the
practicing computer scientist, leave alone the author of a document, that the
only validation being used is "Does Mosaic display this document?"
according to some subjective measure of "correct display".
At present, I have a hard time understanding for example what kinds of nesting
are allowed by a particular DTD, when reading the HTML spec, I just resorted
to the descriptive statements for each of the elements.
I spent considerable time installing/understanding SGMLS a couple of months
ago, and after fighting hard even managed to find a dtd for html on the net as
well as the other files necessary to make sgmls parse simple html
documents. But the whole process of running SGMLS is so obfuscated, i can't
remember all the things I needed to do now, and after retrieving the latest
DTD just gave up on trying to validate my documents using it and SGMLS as a
waste of time.
This state of affairs is frightening!
I may be wrong, but I somehow get the impression that the whole story
regarding sgml/HTML has been made more complicated/obfuscated than it needs to
be.
I know these are radical statements, but something needs to be done to make
sgml/HTML validation and processing more palatable, or we'll have to spend the
rest of our careers retrofitting our documents lto kluges like Mosaic.
--Raman