Re: Unicode browsers (was: Re: Comments on: "Character Set" Considered Harmful)

Bob Jung (bobj@netscape.com)
Thu, 27 Apr 95 19:21:15 EDT

At 10:15 PM 4/26/95, Gavin Nicol wrote:
>>>There are many small issues, but from my experience, and Amanda and
>>>others will verify this, implementing a Unicode based application is
>>>*far* easier than trying to support even a small number of coded
>>>character sets and encodings.
>>
>>I disagree.
>>
>>Supporting canonical Unicode will require major changes to parser and
>>layout engines. Supporting ASCII-superset encodings is relatively easy
>>and in many case more efficient. UTF8 is an ASCII-superset and would
>>fall in the easy to support bucket.
>
>Only because you designed your system with Latin-1 as a basic
>assumption.

Partly true. But there are other issues: performance, resource usage,
round-trip conversions and extensibility.

Performance: In the current state of the world there are no Unicode fonts
on the
systems that we ship our products and there are no Unicode Web data. This
requires at least 2 conversions of the data (e.g., Latin1 -> Unicode and
Unicode -> Latin1, SJIS ->Unicode and Unicode -> SJIS). With the Netscape's
current architecture it often need NO conversions.

Resource Usage: The converters needed would require large tables and probably
extra buffers. Users who do not need multi-lingual (the majority) would be
impacted
for no benefit. There's interest in putting browsers on smaller and smaller
devices -- so memory is still an issue

Round-trips: User-defined areas will not survive the round-trip conversions
to-and-from Unicode. Without the conversion, we have a chance that it will
work for the intended target (e.g., Acme Corp's SJIS data displayed on
Acme Corp's system).

Extensibility: If we commit to a UCS-2 internal represenation, it could
possibly
restrict the support some encodings. I may need to handle UCS-4 data.
Periodically I hear that Chinese will someday exceed the 2byte limitation.
Is this real? Do we care? I suppose, we could use UCS-4 for internal
representation...

>Shoehorning Unicode support onto this may be hard *in the
>short term*, but I think you'll find that as you add support for more
>and more coded character sets and encodings, you will eventually
>produce a system that does exactly this.

Depends where we are headed with HTML. Widechar (e.g. canonical Unicode
has great advantages when a lot of text manipulation is performed. Currently
browsers performs very little text manipulation and so for the few cases
where it does, it's not terribly difficult to deal with.

>>But this is implementation detail, and should be of secondary importance
>>in defining our direction. Web content requirements are what should
>>be of primary importance.
>
>Quite.

I've just tried to point out that the implementation issues are not as
straightforward
as you might think.

Let's put aside these implementation issues for now. They are important,
but I think
the labelling issue is more important issue for us to grapple with now.

Regards,
Bob

--
Bob Jung        bobj@netscape.com       +1 415 528-2688, fax +1 415 528-4122
Netscape Communications Corp.   501 E. Middlefield      Mtn View, CA   94041