How to specify 10646 as document char set

Glenn Adams (glenn@stonehand.com)
Mon, 1 May 95 11:46:41 EDT

Date: Mon, 1 May 1995 01:19:30 +0500
From: connolly@w3.org (Dan Connolly)

OK, here's the real reason I don't want to put ISO10646 in the SGML decl
for 2.0: I don't know how to do it, and I don't have tools to test it.

At a minimum, all you have to do to change the SGML declaration to
support 10646 is to add the following two lines to the current SGML
declaration's document character set clause immediately after the
declaration of the Latin1 baseset (which descres character numbers
160...255):

BASESET "ISO Registration Number 176//CHARSET
ISO/IEC 10646-1:1993 UCS-2 with implementation level 3//ESC 2/5 2/15 4/5"
DESCSET 256 65280 256

Given this change, you end up with three basesets which define the following
ranges of character numbers:

0 ... 127 ISO 646:1993 (IRV) -- i.e., ASCII
128 ... 255 ISO 8859-1 GR
256 ... 65535 ISO 10646-1:1993 UCS-2

When running this against SGMLS and NSGMLS you will get a warning and an
error from the former and no problems with the latter:

% sgmls -p test.html
sgmls: Warning at test.html, line 28 in declaration parameter 37:
Unrecognized designating escape sequence "ESC 2/5 2/15 4/5"
sgmls: SGML error at test.html, line 29 in declaration parameter 41:
Character number plus number of characters exceeds 256

% nsgmls -p test.html

Now, to eliminate these messages from SGMLS, apply the following patches
(and #define PERMIT_OUT_OF_RANGE_DESCRIBED_CHARS to 1 in config.h):

-------------------------------------------------------------------------

*** sgmldecl.c.orig Mon May 1 10:32:42 1995
--- sgmldecl.c Mon May 1 11:19:08 1995
***************
*** 256,261 ****
--- 256,262 ----
static struct pmap charset_map[] = {
{ "ESC 2/5 4/0", (UNIV)asciicharset }, /* ISO 646 IRV */
{ "ESC 2/8 4/2", (UNIV)asciicharset }, /* ISO Registration Number 6, ASCII */
+ { "ESC 2/5 2/15 4/5", (UNIV)systemcharset }, /* ISO Registration
Number 176, ISO/IEC 10646-1:1993 UCS-2 Implementation Level 3 (Unicode) */
{ SYSTEM_CHARSET_DESIGNATING_SEQUENCE, (UNIV)systemcharset },
/* system character set */
{ 0 }
***************
*** 531,541 ****
--- 532,550 ----
sderr(E_CHARDESC, ltous(start), (UNCH *)0);
return FAIL;
}
+ #if ! PERMIT_OUT_OF_RANGE_DESCRIBED_CHARS
if (start + count > 256)
sderr(E_CHARRANGE, (UNCH *)0, (UNCH *)0);
else {
+ #else
+ {
+ #endif
int i;
int lim = (int)start + count;
+ #if PERMIT_OUT_OF_RANGE_DESCRIBED_CHARS
+ if ( lim > 256 )
+ lim = 256;
+ #endif
for (i = (int)start; i < lim; i++) {
if (status[i] != UNDESC)
sderr(E_CHARDUP, ltous((long)i), (UNCH *)0);

-------------------------------------------------------------------------

Once you apply the above, you will be able to parse the new SGML declaration
which supports 10646. If you use sgmls or a non-MULTIBYTE build of nsgmls to
parse a file containing a numeric charref > 255, then you will see something
like the following output (I put &#256; in the text):

% sgmls -s test.htm
sgmls: SGML error at test.htm, line 785 at ";":
Numeric character reference exceeds 255; reference ignored

% nsgmls -s test.htm
test.htm:785:29:E: `256' is not a valid character number

With the MULTIBYTE build of nsgmls you will not receive the latter
message.

-------------------------------------------------------------------------

Given that few existing browsers actually parse or use the SGML Declaration
as specified, I don't think there is any problem in making this change.
Furthermore, the browsers that do parse the declaration can make the above
changes quite easily as I've shown with SGMLS.

The above described changes to the SGML declaration are the minimum needed
in order to extend the standard HTML document charset out to full 10646 UCS-2.
Other additional changes can be made in the future that will facilitiate
greater use of 10646, e.g., the use of characters outside of the ASCII
repertoire for markup.

I would urge folks to adopt the above minimum changes as a first step towards
I18N of the Web. It has little or no impact on existing practice (other than
clearing up what numeric char refs refer to), and, at the same time, it
provides the essential ingredient for moving forward. Getting this change in
the first RFC (for 2.0) will be an excellent way to promote this movement
without impeding current implementation.

If you accept this change, I'd be happy to contribute to writing or editing
any related text in the RFC.

Regards,
Glenn

[P.S. in the current SGML declaration for HTML you should remove 255 as a
shunned character since it is a valid Latin 1 data character (LATIN SMALL
LETTER Y WITH DIAERESIS); also do not specify it as UNUSED in the descset.]