Re: new HTML spec, sample implementation

Tim Berners-Lee <timbl@www3.cern.ch>
Date: Tue, 12 Jan 93 09:39:27 +0100
From: Tim Berners-Lee <timbl@www3.cern.ch>
Message-id: <9301120839.AA00817@www3.cern.ch>
To: Dan Connolly <connolly@pixel.convex.com>
Subject: Re: new HTML spec, sample implementation 
Cc: www-talk@nxoc01.cern.ch
Reply-To: timbl@nxoc01.cern.ch

>  Date: Fri, 08 Jan 93 13:57:32 CST
>  From: Dan Connolly <connolly@pixel.convex.com>
>  

>  This question seems to confuse two things: the ISOlat1 entity
>  set, and the ISO Latin 1 character set. The first is mapping
>  of names to glyphs, and the second is a mapping from the numbers
>  128-255 to glyphs. I think they're in alphabetical order
>  by name, but not in order by the ISO Latin 1 character set.

I think we should specify ISO latin 1 as the base set.  I think that  
a lot of people in the nordic countries use it routinely and they
will go crazy if they have to use overload the crurly brackets again
as they have to with mail.

Therefore, we should allow those people who have 8-bit capability to
just stick in 8-bit codes.  Admitedly I thought the ISO world kept to
the codes 21-7E and A1-FE hex for G0 and G1 graphics sets, using the  
others for control sets (C0 and C1). Maybe ISO Lantin 1 has nothing  
to do with ISO 8 bit extensions. Sorry I can't quote ISO numbers.
But whatever is common usage, let us have an 8 bit set.

(Anybody illuminate us on this?  Anybody got the ISO Latin 1  
character set listing by number?)

Now for died in the wool 7-bit hackers, is it fair to requier them to  
remember numbers, or would it be nicer to allow them to put in
codes using entity names?  Some people would I am sure like the  
latter, but it is NOT important because we are aiming for wysiwyg  
editors and so would regard human-readable character names as a  
temporary thing anyway.


>  Here is the crux of the matter:
>  

>  >The communication between it and the text object would have to be  
defined in  

>  >terms of a particular character set
>  

>  And this character set is stated in the SGML declaration at
>  the top of html.dtd.

No - that is something different. In the top of the DTD is specified  
the reference base set for the DTD itself and SGML documents.
The interface between two software modules is something else and can  
be independent of that.

>  If we define HTML in terms of the
>  full ISO Latin 1 character set, then the parser can deal with
>  &ouml, and pass it to the text object as a data character, just
>  like an 'A' character. For X displays using iso8559 fonts, that's
>  cool.


Sorry, is iso8559  = Iso latin 1?  (I have no head for numbers >1 :-)

yes it is cool. Use Midas or Viola to look at the Hyper-G stuff and  
it works very nicely.

>  But on a PC or a Mac, that means the text object will have to
>  scan all the data it gets and convert the Latin1 encoding to
>  it's own. Yuck.

Yup. Big deal?  Not really. Just a set of parallel tables.  Peter  
Flynn of the CURIA project is producing a lot of archived gaelic and  
is currently dealing with a requirement for a line-mode browser which  
can switch its characetr set depending on the terminal emulator the  
reader is using.

Problems only occur if there are characters which can't be mapped 1-1  
to the local set, and must be represented by more than one character  
(like uumlaut -> ue, ae dipthong -> ae etc) AND you can edit, in  
which case the original form must be preserved. In this case, passing  
on of the entity is essential.  But doing it for every character >127  
would be a pain memorywise. So I would suggest that a configuable  
table define which entities can be crunched down to a single  
character in the local representation and the rest be passed on from  
the SGML parser to the SGML app as external entities.

>  >... and perhaps if there is more than one  

>  >contender the SGML engine could have a compilation option.
>  

>  Hmmm... One might argue that as long as we support conversion  
inside
>  the SGML parser for EBCDIC machines, we might as well support PC  
and
>  Mac character sets while we're at it.

Yes.

Tim