Re: Characters in range 128-159, incl.

Murray Maloney (murray@sco.COM)
Wed, 25 Jan 95 13:54:53 EST

Stan Newton asks some good questions below.
I'll answer as best I can, but I hope that
others will jump in and straighten me out
on any opinions that may not jibe with practice
in various browsers, or on proposed future directions....
>
> Dear Murray,
>
> Since you are very knowledgable about character sets and have indicated a
> willingness to help those more in the dark....
>
> Truetype fonts in Windows use the character range of 128-159 for several
> 'publishing' type characters, particularly the various single and double
> smart quotes. As a result, I am receiving files from Windows users for
> conversion into HTMl documents which contain characters in this range and I
> don't know what to do about them.

This is a big problem because of differing definitions
for the characters in that range on various platforms.
This is why I originally got interested in this topic.

>
> Should I...
> Disallow them entirely since the HTML 2.0 spec says they are unused?
> Disallow would have to mean either substitute some other acceptable
> character on a case by case basis or remove entirely.

What you should do will depend a lot on who you are trying to serve.
If you can reliably predict that only Windows clients will be
using your HTML pages, then I think that you could use these
characters -- notwithstanding the fact that a validating
SGML parser may reject your document if these characters
are defined as SHUNCHARS (I don't recall and don't have the
SGML declaration handy, but I did recommend that a while ago).

>
> Escape them using a numeric escape sequence like &146;? Since the spec
> doesn't specifically allow this numeric value (UNUSED remember), it would
> not seem to be valid. But I tried it in SpyGlass and NCSA Mosaics and it
> worked (i.e. was converted to the proper character) in both cases. This may
> not be true of other browsers.

Did you try it on non-Windows browsers? I suspect -- maybe Corprew
could verify -- that Mac browsers have a slightly different
mapping and that most/all? UNIX browsers do not handle these codes.

SCO, for example, has two browsers -- Mosaic and scohelp --
which behave a bit differently with respect to characters,
inasmuch as scohelp has support for a custom entity set
which capitalizes on the presence of the Adobe symbol
font on most (if not all) X Window System font servers.
We've accepted the fact that non-scohelp browsers will not
be able to read these characters, but we had to have them
for SCO's online doc and context-sensitive help.

So, like us, if you decide to pass these characters through,
you could end up with the wrong character showing up on some
browsers, and no character showing up on others.

I think that it would be a good idea for browsers to agree on
a convention for displaying a consistent symbol -- TBD -- when
a character code is found in an HTML document for which there
is no corresponding glyph available for display. In the olden
days, we used to use the DEL glyph -- a gray or solid box --
for this purpose on character terminals. But that doesn't help
you right now.

Anyway, the bottom line on character codes is that there is
no agreement among the various OS environments as to the
character code to glyph mapping in the upper 128 characters.
So, no matter what we agree to do for HTML user agents,
people creating HTML on these platforms will not be able
to get what they expect consistently -- by which I mean
that someone using an editor on Windows and Mac will
each insert a different code for the same glyph and
you won't be able to determine which is which unless
you know the source, which will probably renew usage
and add new meaning to the epithet "Consider the source". ;-}
>
> I'm trying to produce compliant HTML 2.0 documents but I'm struggling with
> this one.

For HTML 2.0, I am sad to tell you that there are no reliable
options to simply not using character codes or numeric entity
references in the range 128-159. You've got ASCII and Latin1
character codes, and you've got the various numeric entities
that go beyond what is in the Latin 1 character set.
>
> Can you please help?

I could try to help you match up your character codes with
existing mechanisms, but I suspect that you have already
done that. I'll definitely help work with everyone to
define more extensive mechanisms and syntax for HTML 2.N
and HTML 3.0. I have already suggested that we add as
many as possible/practical of the entities defined in
the appendix of the SGML spec (ISO 8879:1986), but there
has not been a lot of discussion on that for a while.
I don't think that anyone is opposed, in principal,
but there are practical issues -- like agreeing on
a set of fonts for each platform/browser combination
that contain the required character repertoires.

Can you think of anything else that I can do to help you out?
Can anyone else add to or comment on my reply?
>
> Stan Newton
> Newton Computing Solutions
>
===========================================================================
---------------------------------------------------------------------------
Murray C. Maloney Internet: murray@sco.com
Technical Publications Writer/Architect Uucp: ...uunet!sco!murray
SCO Canada, Inc. My Phone: (416) 960-4031
130 Bloor Street West, 10th Floor Fax: (416) 922-2704
Toronto, Ontario, Canada M5S 1N5 SCO Phone: (416) 922-1937
===========================================================================
Disclaimer: I'm speaking for myself. 'T ain't nobody else to blame but me.
---------------------------------------------------------------------------
Sponsor member of Davenport Group (ftp://ftp.ora.com/pub/davenport/)
Member of IETF HTML Working Group (http://www.hal.com/%7Econnolly/html-spec/)
Member of SGML Open Internet and WWW Technical Committee
===========================================================================