I would appreciate any and all comments. All errors of language or
fact are solely my responsibility (ie. if I am making a fool of
anyone, I am making it of myself ;-)), and I would appreciate
correction.
An HTML version of this paper is available upon request.
Merry Christmas to all!
-------------------------
HANDLING MULTILINGUAL DOCUMENTS IN THE WWW
Gavin T. Nicol
Electronic Book Technologies, Japan
1-29-9 Tsurumaki, Setagaya-ku,
Tokyo 154,
Japan
+81-3-3706-7351
gtn@ebt.com
ABSTRACT.
The World Wide Web has enjoyed explosive growth in recent years, and
there are now millions of people using it all around the world.
Despite the fact that the Internet, and the World Wide Web, span
the globe, there is, as yet, no well-defined way of handling
documents that contain multiple languages, character sets, or
encodings thereof. In this document, a method is proposed for
cleanly handling such multilingual text on the WWW.
1. Requirements for multilingual applications
There are many issues facing a system claiming to be multilingual,
though all issues fall into one of 3 categories:
1. Data representation issues
2. Data manipulation issues
3. Data display issues
This document is primarily concerned with data representation, though
display issues are also discussed in some detail. Data manipulation is
little discussed.
1.1 DATA REPRESENTATION ISSUES
In general, the major data representation issues are character set
selection, and character set encoding. The biggest problem with
character set selection is that most are standards, and as Andy
Tanenbaum once noted:
The nice thing about standards is that there are so many to choose
from.
Character set encodings suffer in much the same way. There are a large
number of character set encodings, and the number is *not* decreasing.
For any application that claims to be multilingual, it must obviously
support the character sets, and encodings, used to represent the
information it is processing. It should be noted that multilingual
data could conceivably contain multiple languages, character sets, and
encodings, further complicating the problem.
1.2 DATA DISPLAY ISSUES
Quite obviously, for a viewing application to be considered
multilingual, it must be able to present multilingual data to the
reader in a sensible manner. The common problem to overcome here is
obviously font mapping, however, languages around the world have
different writing directions as well, and some languages have mixed
writing directions, which should also be handled "correctly."
Note that in the above paragraph "in a sensible manner" does not
necessarily mean "be able to rendered in it's native format." One
other possiblilty would be to render the multilingual document as a
phonetic rendering in ASCII. For example, if some Japanese text was
sent to a person that cannot read Kanji, Hiragana, or Katakana, the
browser could conceivably map the Japanese text into something like
the following:
Nihongo, tokuni Kanji, wa totemo muzukashii desu.
possibly with some extra text indicating that this is Japanese.
Another possibility is machine translation (which is becoming more
viable year by year).
1.3 DATA MANIPULATION ISSUES
In order to be multilingual, the application must be able to
manipulate multilingual data (including such issues as collation,
though this is not currently needed for browsing the WWW). One major
issue here is the representation of a character within the
application, and the representation of strings. In some applications,
multibyte formats are used throughout, in others, fixed width, wide
characters are used throughout. In others, a combination is used.
1.4 SUMMARY OF REQUIREMENTS
A multilingual application must be able to:
1. Support the character sets and encodings used to represent the
information being manipulated.
2. Present the data meaningfully if the application is required to
display the data.
3. Manipulate multilingual data internally.
2. Bringing multilingual capabilities to the WWW
Having established some basic requirements, it is now time to look at
how the above fits into the World Wide Web.
2.1 MIME ISSUES
One of the problems with representing multilingual documents in the
WWW is that MIME explicitly merges the character set and character set
encodings together. In fact, it is probably more accurate to say that
MIME specifies only the character set encoding, which in turn defines
a character set by implication. For example:
charset=unicode-1-1-utf-7
actually specifies the UTF-7 encoding of Unicode explicitly, but only
specifies Unicode implicitly.
In addition, the MIME specification states that for the text/* data
types, all line breaks must be indicated by a CRLF pair. This implies
that certain encodings cannot be used within the text/* data types if
the WWW is to be strictly MIME conformant.
2.2 SGML ISSUES
SGML does not define the representation of characters. ISO 8879
defines a code set as a set of bit combinations of equal size, a
character as an atom of information, a character repetoire as a group
of characters, and a character set as a mapping of a character
repetoire onto a code set such that each character is represented by a
unique bit combination within the code set. As such, an SGML parser is
independent of the physical representation of the data, and there is
often an internal representation of characters that could be quite
different to that used in data storage.
ISO 8879 also defines some methods for handling things like ISO-2022,
but some encodings for languages such a Thai cannot be handled by
SGML, even if the SGML declaration is altered (though, it is possible
for the application to deal with this within, or before, the entity
manager).
2.3 BROWSERS ISSUES
Basic requirements for a multilingual WWW browser were listed in
section 1.4. Let's now look at each as they apply to WWW browser like
Mosaic.
2.3.1 Support the character sets and encodings
As noted above, there are a huge number of character sets, and
encodings. If the above requirement is taken literally, it means that
in order to have a multilingual WWW, each browser must potentially be
able to understand this huge number of character sets and encodings.
Taken at face value it also means that the SGML parser would need to
handle a large number of character sets, and one would not be able to
have multiple character sets within a document (or to be precise,
SGML provides no way for the parser to handle a given bit combination
representing more than one character within a given document). As
such, the brute force approach would almost certainly have to map the
multiple character sets and encodings to a single internal
representation (though it is conceivable that character sets could be
paged in and out, this approach would be very complicated to
implement). This internal representation would almost certainly have
to be 16 bits wide, or wider. SGML working group 8 has stated that
multilingual systems should map data storage encodings to wide
characters before, on in the entity manager (and the sp parser from
James Clark serves as a good model of thier recommendations).
2.3.2 Present the data meaningfully
One of the great benefits of SGML (and HTML to a lesser degree), is
that it is independent of the display technology. As such, the
presentation issue depends very much on the application (for example,
rendering multilingual documents on a TTY might require the
aforementioned phonetic rendering, while GUI based systems require
font mapping).
A common thread running through all display technologies is the need
for some mapping from the character codes to some output
representation. While not required, using 16 bit (or greater) codes
will probably simplify the mapping task as it requires only a single
lookup table.
2.3.3 Internal multilingual data manipulation
There are a large number of issues here, but in general, having a
fixed width character eases the task of manipulating that data,
especially in languages such as C, because it makes memory management
and indexing easier. In addition, fixed width codes require no
sychronisation.
3. Is Unicode the answer?
In a word, YES! Though there are a number of issues that need to be
resolved in order for it to be used effectively.
3.1 WHAT ARE THE BENEFITS?
First, let's step back and look at the overall architecture
The desired features in a multilingual WWW system are:
1. Allow publishers to create and manipulate documents in the local
character set and character encoding scheme
2. Allow readers to follow a URL to a document, and expect to be able
to read it irrespective of the character set and encoding used for
document creation.
3. Allow readers to save the document in their preferred local
encoding.
4. Scalability. Will the solution work even when large numbers of
languages, character sets, and encodings are used within a single
document?
5. Keep implementations as simple as possible.
Unicode solves most if these items quite nicely.
The second and fourth points are covered because Unicode provides
codes for most languages of the world. It does not cover all languages
completely, but it certainly contains enough for most common uses, and
there are ways for handling the uncommon cases.
The last point is covered because the UCS-2 encoding of Unicode is a 16
bit wide, fixed width encoding. As pointed out above, using such fixed
width characters simplifies SGML parsing, display, and data
manipulation.
The first and third points are not covered directly, but simple
translation techniques can be used to achieve them as outlined below.
3.2 INCORPORATING UNICODE INTO THE WWW
The following outlines a proposal for incorporating Unicode into the
WWW in such a way that all of the above points are solved. Where the
word server is used, it should be taken to mean a HyperText Transfer
Protocol (HTTP) server unless specifically qualified otherwise.
3.2.1 Unicode incorportation architecture
In order to make multilingual support as painless as possible, it is
proposed that all HTTP servers for multilingual documents *should* be
able to convert documents from the local character set encoding to
UCS-2, UTF-8, and UTF-7 (16, 8 and 7 bit encodings of Unicode). It is
also proposed that all HTTP clients *should* be able to parse UCS-2,
UTF-8 and UTF-7. It is *recommended* that browsers allow the data to be
saved as UTF-7, UTF-8, or UCS-2 (similar to the current ftp
interface). If possible, a browser *should* also allow the data to be
saved in the local character set encoding, but that might not always
be possible (for example, saving a document containing Arabic on an
ASCII based system). Documents sent from servers would then use a
content type of:
Content-Type: text/...; charset=UNICODE-1-1-UTF-7
Content-Type: text/...; charset=UNICODE-1-1-UTF-8
Content-Type: text/...; charset=UNICODE-1-1-UCS-2
Though UTF-8 and UCS-2 will need some additional encoding applied to
them in order to be strictly MIME compliant. An alternative is to use
an application/* type specifier instead.
This architecture has the following benefits:
Conceptual Simplicity
The model is very clear conceptually: a document is created in
the local encoding, which is then converted into a commonly
understood form. Client-side processing occurs using this
representation.
Ease of client implementation
The number of clients far exceeds that of servers. Thus,
forcing clients to deal with a large number of characters sets
and encodings requires more effort, and the lag in update
period will be greater. Imagine for example, if someone wants
to add support for some new character set encoding they
invented. In order for support to be added to clients, one
would have to update all clients that could potentially access
such data. Compare this to having to update only the servers
working with that encoding directly. In addition, because
Unicode and UCS-2 are already well defined, it is possible to
write an SGML parser, display subsystem, and data manipulation
functions optimised for that representation, realising
significant performance gains.
While initial implementations will need to write the code for
handling Unicode, and the encodings thereof, it is expected
that freely distributable libraries for such things will appear
MIME and ASCII compatibility
UTF-7 and UTF-8 are largely ASCII compatible. In addition UTF-7
was designed as a method for encoding Unicode in MIME, so it is
perfect for WWW and MIME uses
HTML compatability
Recent drafts of the HTML specification state that a MIME
charset parameter will override the default IS0-8859-1, so this
presents no problem. In addition, it should not require large
changes to current HTML parsers.
Truly multilingual
As noted, this proposal uses Unicode. All languages defined
within ISO-10646 can be used within a single document. SGML
(HTML) parsers will work with UCS-2 directly, so different
languages would be treated identically at the parsing level.
While the cost of performing translations from the local encoding to
one of the Unicode encodings might appear prohibitive, it is believed
that intelligent servers will cache translated documents in a manner
similar to current proxy caches. Between this, and the fact that most
WWW documents are small, the performance hit should not be overly
significant.
3.2.2 Extension to the basic architecture
Requiring that *all* non-ASCII documents be converted to Unicode is
probably a very poor idea as it would incur significant overhead.
Instead, HTTP clients can indicate encoding preferences via the
Accept: field in the request header. For example:
Accept: text/html; charset=iso-2022-jp
Accept: text/html; charset=unicode-1-1-ucs-2
Accept: text/*; charset=unicode-1-1-utf-7
If a server is able to deliver the document in one of the preferred
encodings, it should do so. This will allow clients and servers
sharing a common local encoding to transfer documents without the
overhead of Unicode translation. Note that most encodings will need
additional encoding to strictly conform to the text/* MIME types.
Assuming that all clients are indeed able to parser UTF-7, UTF-8, and
UCS-2, the server should default to delivering multilingual documents
in one of these encodings as it will provide the greatest probability
that the client recieves something it can meaningfully process.
3.2.3 Accept-charset
While the charset parameter is sufficient for implementing the
extended architecture, it requires that for each MIME type in which
character set encoding negotiation is desired, a charset parameter
must be defined. Also the server must be able to parse all the type
specifications, and make meaningful decisions based upon them. As the
number of deliverable types increases, so does the complexity of the
server format negotiation subsystem.
It seems desirable to be able to say to the server "for all data in
which charset is meaningful, send me it encoded as xxxxx", as it
would tend to simplify decoding the data into the applications'
internal character representation. This can be accomplished via
wildcarding the Accept: field, but a somewhat cleaner alternative
would be to have an addition field Accept-charset: which sets the
default encoding; requests like:
Accept: text/html; charset=iso-2-22-jp
could be used to decide the next best encoding if that of
Accept-charset: cannot be delivered. Such cases would probably be
uncommon if one assumes that multilingual data will be sent as Unicode.
3.2.4 Presentational hints for Unicode
While Unicode certainly serves as an excellent lowest common
denominator for multilingual documents, systems using Unicode require
more information than that contained in the character codes
themselves. Probably the most well known example of this is in the Han
unification used in Unicode. Unicode defines codes for characters that
are shared between Chinese, Korean, and Japanese, but the glyph images
used in each language are different. Hence, we need to know the
language in which the character code occurred in order to display it
"correctly". Another interesting case occurs in corporate names in
Japan. It is common for Japanese corporations to use a slightly
different glyph image for the characters that make up the company name
as a way of distinguishing themselves. Again, we need extra
information to map the base code onto the correct glyph image. Such
data are referred to as presentational hints within this document.
Given that a conversion from the local character set to Unicode is
being performed by the server, and that this conversion is automatic,
it seems possible for the conversion process to automatically include
presentation hints in the converted output. Applications that
understand the hints can use them to improve the conversion resolution
where necessary, while other applications can simply ignore them, or
remove them from the data stream. (Strictly speaking, presentational
hints are not necessary as in most (perhaps all) cases, the text will
be legible, even if the glyph image is not quite correct. Rather, they
are desirable for top quality, and especially, typograhic quality,
output.)
The problem of representing presentational hints is a difficult one.
Obviously, it is better to represent such data as tags, rather than as
codes, and HTML 3.0 includes a <LANG> tag in the DTD specifically for
this.
However, high-level tag use (eg. defining them in a DTD) fails for
the following reasons:
1. It is not transparent. The application processing the data stream
must be able to parse the tags, even if it can not do anything
with them. This necessarily complicates the parser.
2. There are probably a huge number of presentation hints that could
be used, and the list is dynamic as societal trends tend to alter
languages. Good examples can be found by comparing almost any
current written form of a language to that used 100 years ago.
Some languages have even changed dramatically in the last 50
years.
This argues for a low-level tag which is basically transparent to
anything parsing the input data stream. This in turn implies that the
presentational hints either take effect before the parser, or that
they can be manipulated unambiguously as data (or that they can
unambiguously be removed from the data stream).
This paper will not attempt to define a format for presentation hints.
Rather, 3 methods are outlined below, and it is hoped that subsequent
discussion will lead to a decision as to which is more applicable to
the WWWW.
Method 1: Code-based presentation hints
Here, codes from the Private Use Area are allocated to represent
presentational hints. The advantages of this method is that hints and
data can be treated identically, and that hints can be removed
transparently. The disadvantages are that it stops other applications
from using the Private Use Area, and also, the Private Use Area has a
limited range. In addition, this "pollutes" the character set with
non-character data.
Method 2: Encoding-based presentation hints
Here, an encoding is used which has space for presentation hints. An
example of this is the ICODE encoding proposed by Masataka Ohta which
uses 21 bits. Mr. Ohta also defines an encoding called IUTF which is
upwardly compatible with UTF-2. This method has all of the advantages
of method 1, but would require at least 21 bits of storage per
character.
Method 3: Tag-based presentation hints
Here, tags are defined to represent presentational hints. One tag
might potentially serve multiple purposes (for example, a LANG tag can
serve to specify any language). The key difference between this
method, and the high-level tag method is that tag interpretation here
occurs before the application proper sees the data. As such, the tags
can be removed transparently. This method would require a certain
amount of API for handling tags: probably based on callbacks. In
addition, it would appear that one code from the private use area will
be needed in order to make tag identification completely unambiguous,
for all data streams.
4. Summary
To summarise, this document proposes the following:
1. All servers of multilingual documents should be able to convert
documents from their local encoding into UCS-2, UTF-8, and UTF-7.
2. All clients should be able to at least parse such data.
3. Clients and servers should also be able to directly transfer data
if they share local encodings
4. A method should be decided upon which allows presentational hints
to be inserted into the data stream to aid in glyph image
disambiguation.
It is beleived that given such an architecture, the World Wide Web
will become truly multilingual, and truly, a World Wide Web.
5. Discussion
The following are a few notes on recent developments which have some
bearing on the contents of this document.
5.1 NOTES ON RELAXED CONTENT PARSING IN HTTP
A recent development is that the HTTP Working Group has basically
decided that HTTP will not require strict MIME conformance for textual
data types. In effect, the recent decision says that the parsing of
the message data should be done in accordance with the specified
character set encoding. This thereby allows multilngual servers to
send any encoding the client claims to understand, including UCS-2,
the 16 bit encoding of Unicode. This should simplify data processing
enormously.
5.2 EXTENDED REFERENCE CONCRETE SYNTAX
Recently, a proposal for an extended reference concrete syntax for
SGML was sent to SGML Open. In this ERCS, the SGML declaration defines
that the BASESET is "ISO 10646:199?//CHARSET UCS-2//EN", and that the
DECSET parameter gives 16 bits for each character.
5.3 ISOLATING APPLICATIONS FROM LANGUAGE AND CHARACTER REPRESENTATION ISSUES
The author beleives that most applications, and especially those based
on SGML, are basically independent of the underlying language and
character representations. Obviously, characters needs to be assigned
to sets for parsing purposes (like LCNMSTRT for example), but most
applications should never need to know anything more about a character
until the glyph image, or information about the glyph image, is
required. In other words, most parsers and processors needn't be
language-aware, but a text display system probably does. This proposal
tries to emphasise this as much as possible by trying to provide
a uniform character code stream for the application to process.
5.3 Unicode data: data manipulation and font handling.
For an excellect look at the issues involved, and one possible
solution to them, it is well worth reading the papers about the Plan 9
system from Bell Laboratories.
6. Bibliography
The Plan 9 Papers
ftp://research.att.com/
East Asian Character Set Issues:
A Proposal For An Extended Reference Concrete Syntax
Rick Jelliffe
Allette Systems
Sydney, Australia
The SGML Handbook
Oxford University Press
Written by Charles Goldfarb
ISBN 0-19-853737-9
The Unicode Standard, Version 1.1
Version 1.0, Volume 1, ISBN 0-201-56788-1
Version 1.0, Volume 2, ISBN 0-201-60845-6
Using Unicode with MIME
D. Goldsmith
http://ds.internic.net/rfc/rfc1641
UTF-7: A Mail Safe Transformation Format of Unicode
D. Goldsmith and M. Davis
http://ds.internic.net/rfc/rfc1642
MIME (Multipurpose Internet Mail Extensions) Part 1
N. Borenstein and N. Freed
http://ds.internic.net/rfc/rfc1521.ps
MIME (Multipurpose Internet Mail Extensions) Part 2
K. Moore
http://ds.internic.net/rfc/rfc1522.txt
Hypertext Transfer Protocol -- HTTP/1.0
T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen
ftp://ds.internic.net/internet-drafts/draft-fielding-http-spec-01.txt
_______________________________________________________________________________
This document in no way reflects the opinions of Electronic Book
Technologies. All opinions contained herein are solely those of the
author.
_______________________________________________________________________________