Re: new HTML spec, sample implementation

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9301081957.AA07944@pixel.convex.com>
To: timbl@nxoc01.cern.ch
Cc: www-talk@nxoc01.cern.ch
Subject: Re: new HTML spec, sample implementation 
In-reply-to: Your message of "Fri, 08 Jan 93 16:14:15 +0100."
             <9301081514.AA02989@www3.cern.ch> 
Date: Fri, 08 Jan 93 13:57:32 CST
From: Dan Connolly <connolly@pixel.convex.com>

[Tim: could you limit your lines to 72 chars like the
rest of the world? It's a pain to deal with lines that
have been split by wierd MTA's.]

>	The spec available as:
>	http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/930106/HTML.html
>	
>	The older versions are all availavle as well.. link to a list of them
>	from the Markup futures page.

Unfortunately, the most recent version of the spec is the hardest to
find. I'd like most of the pointers to the older versions to be
updated. And I'd like to see the instructions to data providers
brought up to date. Currently, most of the documentation (especially
the exmples) doesn't say to quote HREFs, for example.

>>WHERE DO WE GO FROM HERE?
>>  * registering HTML with the IANA
>>  much of the spec is "by example," that is, tolerated.html
>>  demonstrates the tolerated techniques much better than
>>  it explains them.
>
>I agree that this compilicates it.  The explanation is in large part in
>files which are likely to be unreadable by many browers!

The "explanation" is nothing more than an implementors guide. I
thought that implementors would get more from example HTML code
than any prose I could come up with in a reasonable amount of
time. I agree that much is needed in the way of real documentation.

>We would then generate a paper document out of all the hypertext and the .txt
>files, and the document generator won't fall over!
>
>I would also like to see a node on each tag.  I found from user comments that 
>the current explanations of HTML or just not good enough. We need some stuff on
>the side of the spec explaining what each tag is actually FOR, with an example
>of each, acessesed from a reference section.  Maybe that can wiat till after  
>IANA registration, though.

My point exactly: for the IANA, all we need is a correct spec,
not necessarily an easy to use spec. That's what I meant
by "workable."

I don't have time to do a good job explaining how to use HTML.
I just wanted to explore all the technical issues and clear
them up.

>>  * bringing implementations into compliance
>
>It sounds as though (1) HTML.c should be turned into a structued hypertext
>object built on top of -- in the line mode case only -- the HText.c object.

Agreed.

>I feel that the SGML parsing object should create the textobject, because
>one SGML file could cause many displayable objects to be created -- in the  
>future at least.  Theerfore, it is not enough to pass the SGML object the ID
>of an already-created text object. Does that make sense?

At first it didn't, but upon reading it again, it's exactly what
I had in mind.

>Otherwise I think our architecures are converging ... do you want to make
>a new WWW/Library and I'll try it out?  I can release it in parallel with
>the current one for a while for safety.

If I find time, I'd like to.

>	Are the ISO latin 1 characters in  
>http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/921203/ISOlat1.html
>	in order? it would be useful to have their character code there.

This question seems to confuse two things: the ISOlat1 entity
set, and the ISO Latin 1 character set. The first is mapping
of names to glyphs, and the second is a mapping from the numbers
128-255 to glyphs. I think they're in alphabetical order
by name, but not in order by the ISO Latin 1 character set.

>I would have thought the SGML engine ought to be able to resolve all entities.

The SGML parser is responsible for resolving text entities. There
are also external data entites, which are ultimately resolved by
the application. I was leaning towards treating ISO latin characters
as external data entities, rather than text entities. If we treat
ISOlatin characters as SGML characters, we eliminate the need for entity
references in HTML altogether (we could still support &ouml, &lt etc.
however)

Here is the crux of the matter:

>The communication between it and the text object would have to be defined in  
>terms of a particular character set

And this character set is stated in the SGML declaration at
the top of html.dtd.

If we define HTML in terms of the
full ISO Latin 1 character set, then the parser can deal with
&ouml, and pass it to the text object as a data character, just
like an 'A' character. For X displays using iso8559 fonts, that's
cool.

But on a PC or a Mac, that means the text object will have to
scan all the data it gets and convert the Latin1 encoding to
it's own. Yuck.

>... and perhaps if there is more than one  
>contender the SGML engine could have a compilation option.

Hmmm... One might argue that as long as we support conversion inside
the SGML parser for EBCDIC machines, we might as well support PC and
Mac character sets while we're at it.

My original plan was that the core of wwwlib would support
ASCII only, and application developers would deal with latin
characters by name. Moving Latin-1 characters into the core
complicates it somewhat, and I'm still against that.

Dan