Re: Globalizing URIs

Peter Svanberg (psv@nada.kth.se)
Sun, 6 Aug 95 16:23:42 EDT

On Thu, 3 Aug 1995, Jon Knight wrote:

> On Thu, 3 Aug 1995, Gavin Nicol wrote:
> However, I'd rather see something tacked onto the _end_ of the URL rather
> than at the start of the opaque data section. Maybe something like:
>
> http://www.jacme.co.jp/%B0%F5%BA%FE.html;charset=EUC
>
> It just seems more inkeeping with other things I've seen suggested in the
> past.

I can't refrain from reminding you about what we -- three
IETF:ers from Sweden -- suggested on this subject already in
May 1993 (and which nobody took any notice of...):

------
Date: Tue, 11 May 93 10:36:58 +0200
Message-Id: <9305110836.AA06718@mercutio.admin.kth.se>
From: Rickard Schoultz <schoultz@othello.admin.kth.se>,
Olle Jarnefors <ojarnef@admin.kth.se>,
Peter Svanberg <psv@nada.kth.se>
Sender: Olle Jarnefors <ojarnef@admin.kth.se>
To: uri@bunyip.com
In-Reply-To: <9305080145.AA05811@wilma.cs.utk.edu>
(Fri, 07 May 1993 21:45:02 -0400, Keith Moore <moore@cs.utk.edu>)
Subject: Re: Wrappers for URLs

:
:
As has already been pointed out in this discussion, non-ASCII
characters are used a lot in many countries outside USA. In a
language using the Latin script like Swedish almost 1/3 of all
words contain non-ASCII letters. In languages such as Greek,
Russian, Hindi and Chinese no ASCII letters at all are used.
For ordinary users in these countries it will be unacceptable to
see their everyday letters represented by %-headed hexadecimal
digit sequences.

We propose this solution:

C) In the human form of URLs non-ASCII characters may be used
provided a character set indicator is added to the URL
immediately before the closing ">". This indicator shall
have the syntax
"%:" charset
where "charset" is a value registered by IANA for MIME use.
The corresponding coded character set defines the mapping of
the non-ASCII character to a sequence of octets that can be
represented by the %-mechanism in the program form of the
URL.

Take as an example a file called

l<a">s mig

in the directory pub at host othello.admin.kth.se. Here <a">
is the character LATIN SMALL LETTER A WITH DIAERESIS, coded by
the octet E4 hex according to ISO-8859-1. (The name of the file
is Swedish for "read me".)

The human form URL for this file preferred in Sweden would be

<ftp://othello.admin.kth.se/pub/l<a">s mig%:iso-8859-1>

<a"> here would in reality be the non-ASCII character.

The program form would be

<ftp://othello.admin.kth.se/pub/l%e4s mig>

Why is it necessary to include a character set indicator in
these extended URLs containing non-ASCII characters? It's
because that makes the URL resistent to character-preserving
conversion between different coded character sets.

Say that the URL in the example is included in a text file coded
with ISO-8859-1, so the non-ASCII character is represented by
the octet E4. Then this file is transferred to a Mac and
therefore converted to the Macintosh character set. For the Mac
user it will look exactly as intended, containing a lowecase a
with diaeresis (which is necessary to form the Swedish
expression for "read me"). In the Mac file, however, this
letter will be represented by 8A instead of E4. Thanks to the
character set indicator %:iso-8859-1 a client program on the Mac
will however be able to feed the right octet E4 to the FTP
program to fetch the correct file from the FTP server (where
ISO-8859-1 is used in file names).

It could be argued that occasional non-ASCII letters is nothing
to make a fuss about: European users can be taught to read and
input %-sequences instead. But consider a Greek FTP server,
where almost the whole path of a file is written with Greek
letters using ISO-8859-7. In that case URLs will be almost
three times as long and consist of mostly a soup of "%" and
hexadecimal digits, interspersed with "/". Such URLs will be
unusable for humans, unless some way of using the real non-ASCII
letters is provided.

Another points in connection with internationalized URLs that
we would like to raise:

D) The hexadecimal %-headed representation used in the program
form is very inefficient. In countries with languages using
other scripts than the Latin URLs may be almost three times
as long as in English-speaking countries. To reduce this
unfairness we could, in addition to the %-representation.
include a &-representation: After "&" would follow a sequence
of octets encoded by a BASE64-like method into a 33 %
(instead of 300 %) longer sequence of the characters A-Z,
a-z, 0-9, + and -. This sequence would be ended by a
second "&".

--
Rickard Schoultz                        schoultz@admin.kth.se
SUNET/KTH                               +46-8-790 90 88   (voice)
S-100 44 Stockholm (SWEDEN)             +46-8-10 25 10    (fax)

Olle Jarnefors Internet: ojarnef@admin.kth.se Information Management Services UUCP: ...!uunet!mcsun!sunic!kth!ojarnef Royal Institute of Technology (KTH) BITNET: ojarnef@sekth Fax:+46 8 10 25 10 S-100 44 Stockholm, Sweden Phone: +46 8 790 71 26 (time zone +0200)

Peter Svanberg, NADA, KTH Email: psv@nada.kth.se Dept of Num An & CS, Royal Inst of Tech Phone: +46 8 790 71 40 S-100 44 Stockholm, SWEDEN Fax: +46 8 790 09 30