Re: case sensitivity in tags?

Paul Grosso (paul@arbortext.com)
Wed, 10 May 95 07:14:22 EDT

> From: connolly@w3.org (Dan Connolly)
>
> Paul Grosso writes:
> > > From: connolly@w3.org (Dan Connolly)
> > >
> > > > Is this an SGML problem?
> > >
> > > Yes.
> >
> > What is the (perceived) problem? That 8879 (the SGML standard) doesn't
> > cover this (which isn't true) or that what it does say needs explaining?
>
> Er... I meant "Yes, this is a problem where the HTML spec inherits
> from SGML," not "Yes, this shows some problem with the SGML spec."
>
> For example, you can experiment via the HTML validation service
> to see which attribute values are case sensitive and which are not.

> From: Tim Pierce <twpierce@midway.uchicago.edu>
> To: paul@arbortext.com
>
> > What is the (perceived) problem? That 8879 (the SGML standard) doesn't
> > cover this (which isn't true) or that what it does say needs explaining?
>
> Apparently the latter. I was the one who first raised the
> question with Dan. Now, I admit right out front that I'm
> not conversant with SGML, but since that's likely to be true
> of a large proportion of the audience of this Internet
> draft, it strikes me as important to clarify it elsewhere in
> the document.
>
> I asked if it was an "SGML problem" because I didn't know
> whether SGML would permit individual DTDs to specify such
> things as the case of the attributes. As you seem to
> observe, that isn't so.
>

I'll say a few more words assuming it may help some people. I'll try
not to be too technical (though it'll be at the expense of precision).
As far as whether any such discussion is needed in the HTML 2.0 spec,
I leave that decision to others (my vote is not to bother, but I don't
feel strongly if consensus is to add something--the problem is that,
if you are going to add something to the spec, it's going to have to
be more rigorously and carefully worded than what I've got below, and
that could start another whole round of discussion).

Things that are tokenized in SGML *may* be case insensitive while
things that are treated as untokenized strings of data characters
have their case preserved (i.e., are case sensitive). Names of
elements and attributes and entities (among other things) are
tokenized; so are attribute values [technically, "attribute value
literals"] for attributes whose "type" (declared value) is not CDATA.

Tokenized names are divided into two catagories for the purpose of
case sensitivity: general names (which includes element and attribute
names and most other name tokens including any in tokenized attribute
values other than those that represent entity references) and names of
entities. The SGML declaration allows the case sensitivity of each of
these two classes to be indicated via the NAMECASE specification part
of the concrete syntax. The very common (almost universal in practice)
assignment is NAMECASE GENERAL YES ENTITY NO which means that general
names are case insensitive (i.e., all lowercase letters in the name
will be converted to uppercase by the parser) and entity names are
case sensitive. This is the setting in the HTML 2.0 SGML declaration.

With NAMECASE GENERAL YES ENTITY NO, all the following are case insensitive:
element names, attribute names, attribute values for attributes whose type
is ID, IDREF(S), NAME(S), NMTOKEN(S), NOTATION, NUMBER(S), NUTOKEN(S); whereas
all of the following are case sensitive: entity names in entity declarations
and references and attribute values for attributes whose type is ENTITY or
ENTITIES. Data characters and attribute values for attributes whose type
is CDATA are always case sensitive.

As a somewhat tangential note of interest, minimum literals such as
those in Formal Public Identifiers (FPIs) such as "-//IETF//DTD HTML//EN"
and "-//IETF//DTD HTML 2.0 Level 2//EN" and "ISO 8879-1986//ENTITIES Added
Latin 1//EN//HTML" are *case sensitive* and FPIs that have the wrong case
will not match their intended target and therefore will not resolve properly.
Also note that these minimum literals are normalized by converting record
ends to spaces, then condensing all space sequences to a single space and
stripping leading and trailing spaces; however inserting embedded spaces
or record ends where no space belongs (such as, around any of the //'s)
will produce a different FPI that will no longer match its intended target.

paul

Paul Grosso
VP Research, ArborText, Inc.
and
Chief Technical Officer, SGML Open

Email: paul@arbortext.com