REVISION: Character Data

Murray Maloney <murray@oclc.org>

Mail folder: html-archive
Next message: Murray Maloney: "REVISION:"
Previous message: Murray Maloney: "REVISION:"

Date: Thu, 23 Jun 94 08:54:31 EDT
Message-id: <9406230847.aa24756@dali.scocan.sco.COM>
Reply-To: html-ig@oclc.org
Originator: html-ig@oclc.org
Sender: html-ig@oclc.org
Precedence: bulk
From: Murray Maloney <murray@oclc.org>
To: Multiple recipients of list <html-ig@oclc.org>
Subject: REVISION: Character Data
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
X-Comment: HTML Implementation Group


What follows is a revised version of 

	4.3 Character Data

I have

	- rewritten several passages
	- added sub-secs on control chars and special chars
	- provided more extensive intro description
	- provided links to tables of entities and numeric refs
	- misc other stuff

I hope that everyone approve of the changes.

Save as part of "Text.html".
This section was extracted from the discussion of "Structured Text".

Murray

==================== CUT HERE ==============================
<H2><A NAME="Data">Character Data </A> </H2>

<P>
The characters between the tags represent text encoded
according to ISO 8859/1 8-bit single-byte coded graphic character set 
known as Latin Alphabet No. 1, or simply Latin-1. 
There are 256 character positions in the Latin-1 encoding.
Latin-1 includes characters from most Western European languages.
It consists of the space character, 186 characters that form
a subset of the graphic characters in ISO 6937/2 (1983),
and four additional characters that are intended for inclusion in ISO 6937/2.
<P>
The lower 128 character positions include a space,
33 control characters, the 26 upper- and lowercase
letters of the english alphabet, 10 numerals
and 32 other printing characters
This subset, functionally identical to ASCII,
is defined by ISO 646 7-bit coded character set for information interchange,
also known as the International Reference Version.
ISO 646 is identical in most respect to the ANSI standard for
ASCII (American Standard Code for Information Interchange).
The only significant difference between ISO 646 and ASCII is 
the specific names assigned to the control characters which
occupy positions 00-31 and 127.
<P>
The upper 128 positions include a non-breaking space,
a soft hyphen indicator, 93 graphical characters,
8 unassigned characters, and 25 control characters.
<EM>The non-breaking space and soft hyphen indicator
are not recognized and interpreted by all HTML browsers,
and their use is discouraged.</EM>
<P>
There are 58 character positions which are occupied by control characters.
See the discussion for details on the interpretation of
<A HREF="#ctlchars"> control characters. </A>
<P>
Because certain special characters are subject to interpretation 
and special processing, information providers and 
browser implementors should follow 
<A HREF="#spclchars"> these guidelines </A>
<P>
Certain characters may not be accessible from your
keyboard, or some part of your system (i.e. translation software)
may not be equipped to deal with 8-bit character codes.
HTML and many WWW browsers provide 
<A HREF="#charents"> character entity references </A> and
<A HREF="#numcharrefs"> numerical character references </A>
to facilitate the entry and interpretation of characters
by name and by numerical position.
<P>
Because certain characters will be interpreted as 
<A HREF="#markupchars"> markup, </A>
they should be "escaped"; that is, represented by markup 
-- numeric character or entity references.

<A NAME="spclchars">
<H3>Special Characters</H3>
Certain characters are taken to have special meaning
within the context of an HTML document.
There are two printing characters which may be interpreted
by the browser to have an effect of the format of the text.
<UL>
<H4> Space  </H4>
<UL>
<LI> Interpreted as a word space in all contexts except &lt;PRE&gt;.
<LI> Interpreted as a no-break space within &lt;PRE&gt;. </UL>

<H4> Hyphen </H4>
<UL>
<LI> Interpreted as a hyphen glyph in all contexts
<LI> Interpreted as a potential word space by hyphenation engine </UL>
</UL>

<A NAME="ctlchars">
<H3>Control Characters</H3>
Control characters are non-printable characters 
that are typically used for communication and device control,
as format effectors, and as information separators.
<P>
In SGML applications, the use of control characters is
limited in order to maximize the chance of sucessful interchange
over heterogenous networks and operating systems.
In HTML, there are only three control characters which are used.
The remaining 55 control characters are shunned and should not
appear in an HTML document.
The valid control characters and their interpretation are:
<UL>
<H4> Horizontal Tab (HT - 9 dec) </H4>
<UL>
<LI> Interpreted as a word space in all contexts except &lt;PRE&gt;.
<LI> Within &lt;PRE&gt;, the tab should be interpreted 
to shift the horizontal column position to the next position
which is a multiple of 8 on the same line;
that is, <CODE> col := (col+8) mod 8 </CODE>
</UL>
<H4> Line Feed  (LF - 10 dec) </H4>
<UL>
<LI> Interpreted as a word space in all contexts except &lt;PRE&gt;.
<LI> Within &lt;PRE&gt;, the tab should be interpreted 
as a shift to the start of a new line;
that is, <CODE> col := 0; row := row+1 </CODE>
</UL>
<H4> Carriage Return (CR - 13 dec) </H4>
<UL>
<LI> Interpreted as a word space in all contexts except &lt;PRE&gt;.
<LI> Within &lt;PRE&gt;, the tab should be interpreted 
as a shift to the start of the line;
that is, <CODE> col := 0; </CODE>
</UL>
</UL>

<A NAME="numcharrefs">
<H3>Numeric Character References </H3>
Any printing character within
the 8-bit character encoding of ISO 8859/1 (256 character positions)
or the 7-bit character encoding of ISO 646 (128 character positions)
may be represented within the text of an HTML document by a
numeric character reference.
See <A HREF="NumCharRef.html"> Numeric Character References </A>
for a list of the characters, their names and input syntax.
<P>
There are two key reasons for using a numeric character reference.
<UL>
<LI>
the keyboard does not provide a key for the character,
such as on U.S. keyboards which do not provide European characters
<LI>
the character may be interpreted as SGML coding, such as
the ampersand (&#38;), double quotes (&quot;),
the lesser (&#60;) and greater (&#62;) characters
</UL>
<P>
Numeric character references are represented in an HTML document
as SGML entities whose name is number sign (#) followed by a numeral
from 32-126 and 161-255.
The HTML DTD includes a numeric character 
for each of the printing characters in Latin-1,
so that one may reference them by number if it is inconvenient
to enter them directly:
<PRE>
	the ampersand (&#38;#38;), double quotes (&#38;#34;),
	lesser (&#38;#60;) and greater (&#38;#62;) characters
</PRE>


<A NAME="charents">
<H3>Character Entities </H3>
Any of the Latin alphabet No. 1 set of printing characters 
may be represented within the text of an HTML document by a
character entity.
See <A HREF="Entities.html"> Character Entity Set(s) </A>
for a list of the characters, names, input syntax, and descriptions.
<P>
There are two key reasons for using a character entity.
<UL>
<LI>
the keyboard does not provide a key for the character,
such as on U.S. keyboards which do not provide European characters
<LI>
the character may be interpreted as SGML coding, such as
the ampersand (&amp;), double quotes (&quot;), 
the lesser (&lt;) and greater (&gt;) characters
</UL>
<P>
A character entity is represented in an HTML document
as sn SGML entity whose name is defined in the HTML DTD.
The HTML DTD includes a character entity
for each of the SGML markup characters and
for each of the printing characters in the upper half of Latin-1,
so that one may reference them by name if it is inconvenient
to enter them directly:
<PRE>
	the ampersand (&#38;amp;), double quotes (&#38;quot;),
	lesser (&#38;lt;) and greater (&#38;gt;) characters

        Kurt G&#38;ouml;del was a famous logician and mathematician.
</PRE>


<A NAME="markupchars">
<H3>NOTE: Markup Characters</H3>

<P><EM>To ensure that a string of characters has no markup,
it is sufficient to represent all occurrences of &#60;,
&#62;, and &#38; by character or entity references.</EM>

<H3>NOTE: CDATA, RCDATA</H3>

<P><EM>There are SGML features 
(CDATA, RCDATA) to allow most &#60;, &#62;, and &#38; characters 
to be entered without the use of entity or character references.
Because these features tend to be used and implemented inconsistently,
and because they require 8-bit characters to represent non-ASCII
characters, they are not employed in this version of the HTML DTD.
An earlier HTML specification included an XMP element whose
syntax is not expressible in SGML. Inside the XMP,
no markup was recognized except the &#60;/XMP&#62; end tag.
While implementations are encouraged to support this idiom,
its use is obsolete.</EM>

<A NAME="comments">
<H2>Comments</H2>

<P>To include comments in an HTML document that will be ignored
by the parser, surround them with &#60;!-- and --&#62;.
After the comment delimiter, all text up to the next occurrence
of -- is ignored. Hence comments cannot be nested.
Whitespace is allowed between the closing -- and &#62;.
(But not between the opening &#60;! and --.)

<P>For example:

<PRE>&#60;HEAD&#62;
&#60;TITLE&#62;HTML Guide: Recommended Usage&#60;/TITLE&#62;
&#60;!-- Id: Text.html,v 1.6 1994/04/25 17:33:48 connolly Exp --&#62;
&#60;/HEAD&#62;
</PRE>

<H3>Note: Tags in Comments</H3>

<P><EM>Some historical implementations incorrectly consider a
&#62; sign to terminate a comment.</EM>