Re: new HTML spec, sample implementation

Tim Berners-Lee <timbl@www3.cern.ch>
Date: Fri, 8 Jan 93 16:14:15 +0100
From: Tim Berners-Lee <timbl@www3.cern.ch>
Message-id: <9301081514.AA02989@www3.cern.ch>
To: Dan Connolly <connolly@pixel.convex.com>
Subject: Re: new HTML spec, sample implementation
Cc: www-talk@nxoc01.cern.ch
Reply-To: timbl@nxoc01.cern.ch


>  Date: Wed, 06 Jan 93 19:23:43 CST
>  From: Dan Connolly <connolly@pixel.convex.com>
>  

>  I just uploaded the following to info.cern.ch:/pub/incoming
>  libHTML-930106.tar.Z
>  html_spec-930106.tar.Z

	Transferred to /pub/www/dev
	
	The spec available as:
	http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/930106/HTML.html
	
	The older versions are all availavle as well.. link to a list of them
	from the Markup futures page.
	
	Are the ISO latin 1 characters in  
http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/921203/ISOlat1.html
	in order? it would be useful to have their character code there.
	
	I think it is a very wise move to define latin 1 as the base
	character set for HTML.  Non-anglophones in Europe really can't get
	 very far without it.  I remember their urgent please for gopher
	 to go latin1 retrospectively.
	 

	 I feel that it would be friendly to allow EITHER the names OR the
	 numbers for funny characters, including &lt; .
	
>  WHERE DO WE GO FROM HERE?
>  

>  * registering HTML with the IANA
>  

>  The spec is a hypertext. We need a plain text document
>  for the IANA. This is complicated by the fact that
>  much of the spec is "by example," that is, tolerated.html
>  demonstrates the tolerated techniques much better than
>  it explains them.

I agree that this compilicates it.  The explanation is in large part in
files which are likely to be unreadable by many browers!  I would
like to see the whole of the content in "proper" HTML. The most useful way
to put in examples is to make littlefiles Example_nn.html and put soft
links in Example_nn.txt too the same file. This allows people to look simply at  
the source of the example with a browser, and also to try their browser out on  
it as a hypertext document.

We would then generate a paper document out of all the hypertext and the .txt
files, and the document generator won't fall over!

I would also like to see a node on each tag.  I found from user comments that  
the current explanations of HTML or just not good enough. We need some stuff on  
the side of the spec explaining what each tag is actually FOR, with an example  
of each, acessesed from a reference section.  Maybe that can wiat till after  
IANA registration, though.

>  But I think the files HTML.html, Text.html, and html.dtd
>  make a workable spec. html.dtd has all the information,
>  HTML.html motivates it, and Text.html gives enough background
>  to read it.

>  * bringing implementations into compliance
>  

>  LineMode -- Tim: I'd like to use SGML_read to do the lexical
>  stuff in the linemode browser. I haven't thought much about
>  EBCDIC support, but it shouldn't bee too difficult. I think
>  SGML_read will fit neatly between HTParseFormat() and
>  HTGetCharacter().

It sounds as though (1) HTML.c should be turned into a structued hypertext
object built on top of -- in the line mode case only -- the HText.c object.

I feel that the SGML parsing object should create the textobject, because
one SGML file could cause many displayable objects to be created -- in the  
future at least.  Theerfore, it is not enough to pass the SGML object the ID
of an already-created text object. Does that make sense?

Otherwise I think our architecures are converging ... do you want to make
a new WWW/Library and I'll try it out?  I can release it in parallel with
the current one for a while for safety.

>  NeXT browser -- Tim: I'd like to see the stuff on info.cern.ch
>  use the HEAD/BODY elements and &#60 in stead of &lt. If you
>  use the NeXT browser to maintain this stuff, or if anybody else
>  uses the NeXT browser, I'd like to see it brought up to date.

HEAD/BODY elements are in all new text, but not numeric character refs.
I's on the list.
...
>  Now about that last few changes to the HTML spec...
>  

>  INLINE ELEMENTS
>  

>  I added <em>, <samp>, <code>, and several other elements,
>  inspired by TeXinfo. We need these to support conventional
>  technical documentation. The list is not exhaustive, but
>  I think it's pretty good.

The need for a reference section is highlighted by my inability to find these  
in the spec.
  

>  NUMERIC CHARACTER REFERENCES
>  

>  I have learned a few things about SGML and made a few
>  decisions biased toward simplicity. As a result, I think
>  the spec is a little smaller and the sample implementation
>  is a little cleaner.
>  

>  Most notably, I have introduced numeric character references
>  to the HTML spec. These were in SGML all along, but I didn't
>  understand them fully.
>  

>  This raises the issue of character sets. The character set
>  in html.dtd is ISO646, i.e. ASCII. Everybody using html.dtd
>  agrees on the correspondence between the numerals 0-127
>  and the ASCII characters they represent. So to represent
>  a '<' character, we'll write "&#60;". This obsoletes
>  the lt, gt, and amp entities.
>  

>  On the other hand, I did not include an 8-bit character set.
>  So the meaning of "&#255" is not defined. The HTML DTD references
>  "ISO 8879:1986//ENTITIES Added Latin 1//EN" in stead of
>  "ISO Registration Number 109//CHARSET ECMA-94
>  Right Part of Latin Alphabet Nr. 3//ESC 2/13 4/3". So We'll write
>  "&yuml;" for the character that corresponds to position 255
>  in the ISO-8559 encoding.

I would like to see an 8-bit basic character set as I said in some other  
message I think.  (For those Finns  and p{op]e li^e th{t.  :-)

>  In the sample implementation, numeric character references
>  are invisible to the application: the translation from
>  "&#60;" to '<' happens inside the SGML_read routine. On
>  the other hand, entity references like "&yuml;" are
>  handed back to the application for processing.

I would have thought the SGML engine ought to be able to resolve all entities.
The communication between it and the text object would have to be defined in  
terms of a particular character set, and perhaps if there is more than one  
contender the SGML engine could have a compilation option.

That isthe easiest to implement.  It makes the implementation outside SGML the  
least SGML-aware.


>  Dan
>  


Tim