Re: Named Character Entities for BIDI Texts

Glenn Adams (glenn@stonehand.com)
Wed, 26 Apr 95 10:43:19 EDT

Larry sent me the following message which asks why BIDI special
characters are needed. Since some of you may wonder, I thought I
respond to the WG list as well.

From: Larry Masinter <masinter@parc.xerox.com>
Date: Wed, 26 Apr 1995 00:57:02 PDT

Is it really appropriate to use right-to-left special characters for
directionality in a SGML context? As opposed to SGML tags? Somehow it
reminds me of font-shift characters: something's being done at the
wrong level.

These special characters are needed for special purposes; they aren't
used under most circumstances. That is, one doesn't use them to mark
a particular run of text as using a particular direction if the characters
of that text already have a resolvable direction.

The special bidi characters are needed for the following:

1. RTL MARK, LTR MARK - used to disambiguate directionality of
directionally neutral characters, e.g., if you have a double quote
sitting between an Arabic and a Latin letter, then which direction
does the quote resolve to? These characters are like zero width
spaces which have a directional property (but no word/line break
property).

2. ZWJ, ZWNJ - used to force or block joining behavior in contexts
which joining would occur but should not or would not occur but should.
For example, ARABIC LETTER HEH is used to abbreviate "Hijri" (the Islamic
calendrical system); however, the isolated form of HEH looks like the
digit five as employed in Arabic script (actually based on Indic digits).
In order to prevent one from reading HEH as a final digit five in a
year, the initial form of HEH is used. However, there is no following
context (i.e., a joining letter) to which the HEH can join. Therefore,
the ZWJ is used to provide that context. In Farsi texts, there are
cases where a letter that normally would join a subsequent letter
in a cursive connection does not. Here the ZWNJ is used.

3. RTL EMBEDDING, LTR EMBEDDING is used to handle nested directional
runs such as:

Given the following latin/arabic letters in backing store with the
specified embeddings:

LRE L0 L1 RLE A0 A1 LRE L2 L3 PDF A2 A3 PDF L4 L5 PDF

One gets the following rendering (with [] showing the directional
transitions):

[ L0 L1 [ A3 A2 [ L2 L3 ] A1 A0 ] L4 L5 ]

On the other hand, without these characters, e.g., with

L0 L1 A0 A1 L2 L3 A2 A3 L4 L5

and a base level of LTR one gets the following rendering:

[ L0 L1 [ A1 A0 ] L2 L3 [ A3 A2 ] L4 L5 ]

Notice that A1,A0 is on the left and A3,A2 on the right unlike the
above case where the embedding levels are used. Without the
embedding characters one has at most two levels: a base directional
level and a single counterflow directional level.

A common need for the embedding characters is to handle text that
has been pasted from one bidi context to another and the possibility
of multiply embedding pastings.

4. LTR OVERRIDE, RTL OVERRIDE - these are needed to deal with unusual
pieces of text in which directionality cannot be resolved from context
in an unambiguous fashion. For example, in part numbers, formulas, telephone
numbers, and other similar pieces of text, it is difficult or impossible
to derive the directionality of numbers, punctuation, and other neutrals
from their context.

------------------

Of the above special characters, the embedding and override functions
could certainly be accomplished by adding a new element type, e.g.

<!ENTITY % text "#PCDATA | SUB | SUP | B | BIDI | %notmath">
<!ELEMENT BIDI - - (%text)+>
<!ATTLIST BIDI
%attrs;
force (ltr|rtl) #IMPLIED
dir (ltr|rtl) #IMPLIED
>

Thus, you would have:

<BIDI FORCE=RTL>...</> = RLO ... PDF
<BIDI DIR=RTL>...</> = RLE ... PDF
<BIDI DIR=RTL FORCE=LTR>...</> = <BIDI DIR=RTL><BIDI FORCE=LTR>...</></>
= RLE LRO ... PDF PDF

Personally I would prefer using this markup over using the stateful
bidi controls; however, doing this would require (1) adding a new
element type; (2) translating the stateful bidi controls in existing
text to markup.

If 10646 is really to become the document character set, we have to
allow for the appearance of these coded characters. One way to handle
this would be to define SHORTREFs as follows:

<!ENTITY lretag "<BIDI DIR=LTR>" >
<!ENTITY rletag "<BIDI DIR=RTL>" >
<!ENTITY lrotag "<BIDI FORCE=LTR>" >
<!ENTITY rlotag "<BIDI FORCE=RTL>" >
<!ENTITY pdftag "</BIDI>" >
<!SHORTREF bidi "&#LRE;" lretag
"&#RLE;" rletag
"&#LRO;" lrotag
"&#RLO;" rlotag
"&#PDF;" pdftag
>

In this case LRE, RLE, LRO, RLO, and PDF would have to be declared
as function names (mapped to the appropriate character numbers) in
the SGML declaration's concrete syntax:

FUNCTION LRE 8234 -- LEFT-TO-RIGHT EMBEDDING --
RLE 8235 -- RIGHT-TO-LEFT EMBEDDING --
PDF 8236 -- POP DIRECTIONAL FORMATTING --
LRO 8237 -- LEFT-TO-RIGHT OVERRIDE --
RLO 8238 -- RIGHT-TO-LEFT OVERRIDE --

With the above shortrefs and a new <BIDI> element we could deal
with existing text which contains these bidi controls and do so
in the framework of marked up text.

These controls were necessary in Unicode & ISO/IEC 10646 because
they don't have mark-up; i.e., they are based on plain text
encoding only. These controls are fundamentally different from font
shift type controls (or other style information) since they affect the
ability to render a text in a semantically legible fashion. That
is, without these special bidi characters, cases arise which would
prevent *any* rendering whatsoever that reflected the basic meaning
of the text. It is for this reason that these special characters
were added to Unicode (and, thence, to ISO/IEC 10646). If it were
possible to do reliable bidi without them, they definitely would
not have been included in Unicode (at least not the stateful characters:
LRE, RLE, LRO, LRO, and PDF).

While my preference would be to add the above new element type and
shortrefs, I recognized this is more potential work than simply
defining a few additional general entities. It is for this reason
that I proposed the latter. We could add both the collection of
general entity references *and* the above element type and shortref.
This would serve the broadest needs in terms of both compatibility
and flexibility.

Glenn