Merck Research Laboratories
Box 2000, RY50SW-100
Rahway, NJ 07065
Department of Computer Science
University of Central Florida
Orlando, FL 32816
Automatically acquiring knowledge from encyclopedic texts, specifically the biographies of famous people found in the World Book Encyclopedia, begins with an electronic version of the text of a biography and ends with knowledge structures representing the knowledge that has been acquired. This acquisition process is performed without human assistance of any kind and thus involves not only issues in the area of knowledge acquisition, but also issues in natural language understanding and knowledge representation. We describe one problem from each of these two areas: interpreting deverbal nominalizations and the representation of a priori knowledge. The results of two comprehensive experiments are presented which show that these problems can be solved on the way to near human performance in answering a large set of questions.
Automatically acquiring knowledge from encyclopedic texts, specifically the biographies of famous people found in the World Book Encyclopedia, begins with an electronic version of the text of a biography and ends with knowledge structures representing the knowledge that has been acquired. This acquisition process is performed without human assistance of any kind and thus involves not only issues in the area of knowledge acquisition, but also issues in natural language understanding and knowledge representation. The knowledge, once acquired, can then be used to satisfy other computational tasks.
We began acquiring knowledge from encyclopedic texts concerning animals in 1993 and 1994[Gomez et al., 1994; Gomez, 1995a], using SNOWY, a program which acquires knowledge from expository texts[Gomez & Segami, 1991]. At that time, we investigated many of the issues that had to be faced to automatically acquire knowledge about the dietary habits and habitats of animals, using the World Book Encyclopedia CD-ROM as the source of our answers. The results of this investigation were quite promising: the system generated a single, correct parse for 71% of the sentences selected as relevant, and generated a correct and complete interpretation for 55% of the parsed sentences. Using the knowledge acquired during this process allowed the system to answer many interesting questions about animals, such as, Which birds eat nectar? and Do monkeys have enemies?
Towards the end of 1994, our attention turned to what we felt was an even more challenging task: acquiring knowledge about the famous people described in the World Book. Initial examination of these biographies showed that they contained a much larger set of relationships between domain concepts, and that the syntactic construction of deverbal nominalization was much more common in these texts than in the animal ones. The question we sought to answer then became, ``If these two problems can be handled in a general manner, how much knowledge can we acquire from biographical articles?"
The remainder of this paper discusses these issues in detail and presents the results of two tests designed to answer this question. Section 2 provides background information about the parser and semantic interpreter used by the system. Section 3 explains the knowledge we sought to acquire and describes how the two issues of nominalizations and domain knowledge were handled. Section 4 details the results of the two tests. Lastly, section 5 presents related work and our conclusions.
Acquiring knowledge automatically from encyclopedic texts involves parsing and interpreting sentences as a first step. The success of subsequent processes will be limited by the success of the parser and interpreter. Therefore, it is appropriate to mention these two components to acquaint the reader with the methods that are used and how successfully they perform. Emphasis is given to the treatment of the interpreter, because the knowledge structures representing the knowledge that has been acquired are based on the output of the semantic interpreter (see [Gomez, 1996] for a discussion of this issue).
The input to the semantic interpreter is a partial parse. The design of the parsing algorithm has been motivated by the view that the output of a parser should not be a tree, but a structure in which constituents are related to each other not by immediate dominance relations, but by simple dominance relations [Marcus, et al., 1983; Marcus, 1987]. The reason for this is that immediate dominance relations between two constituents cannot be determined on the basis of syntax. As a result of this, the parser does not resolve structural ambiguity and does not attach prepositional phrases and other modifiers. However, the parser does resolve long-distance dependency resulting from relative clauses and questions. The output of the parser consists of the following syntactic relations: PP (prepositional phrase), subject, object, object2, and predicate. Object is built for the first postverbal complement of the verb, and object2 is built for the second postverbal complement of two-object verbs. The parser also produces an analysis of verb adjuncts like time NPs (noun phrases), and distance NPs, e.g., Peter read every hour, Geese fly long distances. Sentential complements are constructed by putting a pointer, a gensym, in the slot object, or, object2. Thus, the following structures are produced for the sentence ``Peter wants to read a book:'' (Note that NPs and verb phrases are also analyzed.)
g01 (subj (Peter) verb (wants) obj (g02)) g02 (subj (Peter) verb (read) obj (a book))
and the sentence ``John told Mary to go to the library'' is parsed into:
g01 (subj(Peter) verb (told) obj (mary) obj2 (g02)) g02 (subj(Mary) verb (go) pp (to(the library)))The parser also analyzes conjunctions, including coordinate conjunctions, comparatives and a good number of appositions. As of this writing, it has a lexicon of 75,000 words, containing a detailed subcategorization for most English verbs. In a test performed on April 14 1997, the parser produced a correct parse for 75% of 300 sentences taken randomly from the The World Book Encyclopedia. Because the parser does not resolve structural ambiguity, it produced just one parse for 95% of the sentences correctly parsed. The average time in parsing a sentence was one second on a Sparc 5 machine, running Franz Lisp. The parser has been constructed gradually during the last twelve years. For a basic description see [Gomez, 1995b].
The semantic interpreter is responsible for constructing the logical form from the output produced by the parser for a given sentence. Interpretation involves: determining the meaning of verbs, called verbal concepts; determining the attachment and meaning of prepositional phrases; interpreting complex noun phrases and determining their relationships to the verbal concept; determining the meaning of subordinate clauses and the referents of explanatory and restrictive relative clauses; and making inferences. As constituents are identified by the parser, the interpreter is invoked to interpret the constituent and to integrate that interpretation into the aggregate interpretation of the clause. Central to this approach is the determination of the clause's verbal concept which is the foundation of its representation. Without a verbal concept, the interpretations of individual constituents can not be combined in any meaningful way. For example, it is the verbal concept of the verb ``conquer" which establishes a relationship between ``Alexander" and ``Persian Empire" in the sentence Alexander conquered the Persian Empire. At the same time, it is the constituents of the sentence which disambiguate the verb. Therefore, as each constituent is given to the interpreter, the interpreter checks to see if that constituent can, with the support of other constituents already interpreted, select a verbal concept. Once a verbal concept has been selected, each newly arriving constituent is placed within its framework. Determining the verbal concept as early as possible is of course helpful, but in many cases it is prudent to postpone determination until more of the sentence has been examined. This ``ambiguity procrastination" [Rich et al., 1987] keeps the interpreter from jumping to a verbal concept prematurely. For instance, rather than immediately deciding that the meaning of ``took" in the sentence Maxwell took the test to Pita is take-examination when the object is interpreted, it is necessary to wait until the prepositional phase (PP) ``to Pita" is parsed. This phrase reveals that the true meaning of ``took" is to transport.
VM rules and verbal concepts play a fundamental role within the interpretation process. There is one set of VM rules defined for each verb. VM rules are classified as subj-rules, verb-rules, obj1-rules, obj2-rules, pred-rules, prep-rules, and end-of-clause-rules. Rules are activated when a verb, a syntactic relation, or a prepositional phrase has been parsed, or when the end of the clause has been reached. In most cases, the antecedents of VM rules contain selectional restrictions that determine whether the interpretation of the syntactic constituent is a subclass of some concept in the system's long-term memory (LTM) ontology. If during an examination of LTM, the selectional restriction is passed, the consequent(s) of the VM rule establish the verbal concept for the verb and the semantic role of the constituent. If no rules fire, the parser inserts the syntactic relation or prepositional phrase in the structure being built and continues parsing.
Verbal concepts are represented as frames, whose slots map syntactic relations to semantic ones. The description of defend is shown below to illustrate the different components of verbal concepts.
(defend (is-a (action-r)) (subj (agent (actor))) (obj (thing (theme)) (prep (against (thing (thing-performing-attack (strong)))) (from (thing (thing-performing-attack (strong))))) )
The first entry, ``(is-a (action-r))," places defend within the hierarchy of actions as a direct subconcept of action-r (see [Gomez, et al., 1997] for a discussion of verbal concept hierarchies). The remaining slots manage the mapping of the syntactic relations to semantic roles. The subj entry represents that if the subject of defend passes the restriction of being subsumed by the concept agent (agents are either humans or organizations), then the subject fills the thematic role of actor. The next entry specifies that the direct object of defend must pass the restriction of thing, which means that in principle, anything can be defended. Finally, the last entry describes how to handle PPs that begin with ``against" or ``from". The object of a PP of the form against <object>, where <object> is a thing, is made to fill the thing-performing-attack role. This handles cases such as: Peter defended Mary against the bullies, where the object of the preposition is an agent; Peter defended Mary against the criticism, where the object of the PP is an idea; and The walls defended the city against the elements, where the object of the PP is a phenomenon. ``From" PPs are handled identically. The value strong after the thematic role indicates that the verbal concept claims that preposition strongly (see [Gomez, et al., 1997] for a complete discussion of prepositional phrase interpretation).
This information is used by the interpreter to attach PPs. Other relation specifications are inherited from the superconcepts of the verbal concept; hence, specifications found in action-r can be used for relations not mentioned in defend. For example, there is nothing in the specification of defend for handling the preposition ``in". When a PP is encountered beginning with ``in", the algorithms first look to the verbal concept for the current clause. If no rules are found, the parent concept(s) are examined. This process continues until some rules are found or the ancestors of the verbal concept have all been examined. The upwards traversal of the verbal concept's superconcepts is terminated, however, after the first set of rules for the given syntactic relation is found. Otherwise, rules which are too general could be applied inappropriately.
The interpreter uses inference rules, called addition rules, to add relations to the output of the interpreter that are implied by the sentence's verbal concept(s). For example, from the sentence, Colin Powell graduated from the City College of New York, the interpreter adds a new relation to the interpretation that is suggested by the sentence's verbal concept, graduate-institution.
(ACTOR (COLIN-POWELL (Q (CONSTANT)))) (PR (ATTEND-INSTITUTION)) (THEME (CITY-COLLEGE-OF-NEW-YORK (Q (CONSTANT))))
This relation represents the inference that if Colin Powell graduated from the City College of New York, then he most likely attended that college. Addition rules provide a mechanism for capturing verbal entailments during interpretation rather than either missing them completely or attempting to ``prove" them at a later time. Our strategy has been to generate inferences pragmatically in that only the most commonly agreed upon inferences, that we are interested in, are made during interpretation. Hence, we would not attempt to posit the potentially infinite set of inferences that one might conceive of: Powell took tests; he read books; he sat at a desk; and so on. These inferences are perhaps valid, but uninteresting, and in fact would not warrant explicit description in encyclopedic texts. Our ``interest" in an inference depends on whether or not it suggests a relation that we are actively trying to acquire. In the next section we describe the relations that define the knowledge we sought.
The first issue we faced was to select which relations the system was to acquire. Certainly, we wanted to acquire ones which were related to the important actions, beliefs and descriptions that make up historical figures' lives, not mundane relations like ingest. However, how does one decide what is important? It was our intent not to inject our own bias into the selection of the questions, thereby addressing any criticism that the question list was hand crafted. We found another source of guidance for selecting our set of questions, the encyclopedia itself. The selection of the content of the World Book was influenced by the Classroom Research Project, which provides continuous testing of the encyclopedia in more than 400 classrooms throughout the U.S. and Canada. Students use the World Book and fill out cards to show what they looked up. Over 100,000 cards are analyzed annually to determine the actual patterns of classroom use. Therefore, the content of the encyclopedia is made to reflect those aspects that students are most interested in learning. Given this practice, the frequency with which a particular topic is discussed within the articles should be related to the level of interest shown by the students.
We analyzed the 5040 biographical articles of the World Book and constructed a list of the words they contain sorted by frequency. From the 980,000+ words in this corpus, we extracted those verbs which could be potentially seen in more than just 1% of the articles, i.e., verbs with more than 50 occurrences, ignoring very common verbs such as ``is" and ``have". From this list we constructed a set of relations suggested by a subset of those verbs. A wide range of relations was selected, though in some cases the relations overlap or are complementary. For example, to be elected and to win an election both imply the same underlying relation, and to attend college and to graduate from college are complementary. But there was no concerted effort to select verbs that all fell within one conceptual hierarchy of relations as was done in the analysis of the animal texts. On the contrary, a broader approach was taken to get a feeling for the breadth of the range of human activities and relations and to see what difficulties would surface that might not otherwise appear if the focus was on a small set of relations.
Having selected the relations the system was to acquire, we chose to implement a presentation component that would allow a user to enter English questions and have the system automatically answer them with a pseudo-English answer. This would show provide a more user-friendly means of verifying that acquisition had occurred. Below are several examples from a list of over 90 questions that the system should be able to answer, assuming that the answer was in the encyclopedia and that the system had performed its task. Phrasing of the questions was allowed to vary in order to target different thematic roles.
The ``X" could be replaced by the name of any person having an article in the World Book. Acquiring historical knowledge to answer these questions required changing the interpreter to handle deverbal nominalizations, and required defining new verbal concepts and VM rules to capture the domain relations and concepts described in the biographies. The next two sections discuss why this was important.
The semantic interpretation of nominalizations is necessary for automatically acquiring knowledge from encyclopedic texts. Each nominalization represents a clausal structure, therefore, for the task of acquiring knowledge, a nominalization is just as important as any other clause. Moreover, ignoring nominalizations can cause problems with the interpretation of the clauses in which they appear because nominalizations are contenders for prepositional phrases, and if these modifiers are not attached to them, PPs may be mistakenly attached to other constituents in the sentence.
These two points would be academic were it not for the fact that nominalizations are so common in the encyclopedia articles we have been studying. In a random selection of 25 paragraphs from biographical articles of the World Book Encyclopedia, 23 had two or more nominalizations. To illustrate the context in which these nominalizations reside, consider the following short passage from an article about Francisco Franco from the World Book Encyclopedia.
In 1935, he became army chief of staff. The following year, the leftists won the election and sent Franco to a post in the Canary Islands. Military leaders plotted to overthrow the leftist government in 1936. Franco delayed taking part in the plot, but he was promised command of the most important part of the army. The revolt began in July 1936 and it started a total civil war. Two and a half months later, the rebel generals named Franco commander in chief and dictator. Franco's forces, called Nationalists, received strong support from Italy and Germany. On April 1, 1939, after 32 months of bitter fighting, the Nationalists gained complete victory. Franco then became dictator without opposition.
Some typical uses of nominalizations are present here. The ``plot", mentioned in the fourth sentence, refers to the event mentioned in sentence before, and is used to describe Franco's lack of involvement in the plan. Anaphoric usages of nominalizations are commonly employed to further specify the thematic roles of the original predicate or as a means for relating the original predicate to new information. Another use is illustrated in the seventh sentence. In, ``Franco's forces [...] received strong support from Italy and Germany," the nominalization ``support" greatly affects the meaning of the clause. Light verbs such as ``receive" are often used in conjunction with nominalizations, and it is the underlying verb of the nominalization that usually specifies the predicate of the clause. Other nominalizations in this paragraph are election, revolt, fighting, and opposition.
If we do not interpret the nominalizations in the passage above, we will not be able to answer questions like Which nations supported Franco's forces? Did Franco plot to overthrow the government in 1936? Who was elected in 1936? And if we can not answer these questions, then our attempt to acquire knowledge from this text has been severely compromised.
Failure to interpret a nominalization often means missing the opportunity to acquire knowledge which could not be acquired in any other way. Of the nominalizations tested, trade most clearly illustrates the necessity of nominalization interpretation for the purpose of knowledge acquisition. There are 200 occurrences of trade used as a nominalization and only 47 occurrences of trade, traded, trades, and trading used as a verb. Therefore, if one is interested in acquiring knowledge about ``who traded what with whom", the nominalization form must be handled or 81% of the trade relations will not even be considered.
The interpreter attempts to determine the verbal concept of the nominalization and to fill its thematic roles. Determination of the verbal concept requires disambiguation of the meaning of the nominalization's root verb. This ambiguity may be resolved by examining the noun phrase in which the nominalization occurs, or as is true in many cases, disambiguation can only be accomplished by examining postnominal prepositional phrases. Once the verbal concept has been identified, surrounding nouns are then interpreted as verbal concept arguments. Three separate algorithms were used: the nominalization noun phrase algorithm, the prepositional attachment and meaning determination algorithm, and the end-of-clause algorithm. The details of these algorithms are presented in [Hull & Gomez, 1996].
The interpretation of a sentence depends on the existence of a priori verbal knowledge encoded as verbal concepts and VM rules, and also on an a priori ontology for content words. Extending the set of these structures to adequately cover the discourse domain is important. Supporting the verbal concepts and VM rules are a priori concepts in LTM which form the selectional restrictions of the verbal concept specifications and the antecedents of VM and preposition rules. For example, suppose that it has been determined that the actor role of a military battle needs to be a military organization. The concept military-organization, if it does not exist in the ontology, should now be defined within it. Defining the necessary concepts is usually undertaken in conjunction with the definition of the verbal concepts that reference them.
A complete discussion of the individual verbal concepts and VM rules developed for this task is beyond the scope of this paper. What we would like to convey, however, is our approach to the basic problem of defining verbal and concept knowledge. The process begins with the identification of the verbal concepts of interest. If these verbal concepts can be organized hierarchically, then selectional restrictions and inferences can be inherited making the creation of new verbal concepts simplier. In case of our biography work, a broad range of verbal concepts were selected, therefore, a single hierarchy did not emerge. The thematic roles of the verbal concept, including those which are filled by prepositional phrases, are developed.
Verbs that suggest the verbal concepts are represented by VM rules. Determining which verbs ``suggest" a verbal concept can be time consuming because the list often contains verbs that are not considered direct synonyms. Any necessary concepts are then added to the system's ontology. Finally, inferences are added to the verbal concepts In the past this has been a somewhat ad hoc process in that we did not attempt to create a comprehensive hierarchy of verbal concepts, but rather incremently added new verbal concepts as they were needed. Recently, Gomez[Gomez, 1997] has begun to leverage WordNet verbal knowledge in hopes that it will become the foundation of just such a comprehensive hierarchy.
One of the relations the system acquired was the verbal concept attend-institution. Consider what occurred when the question ``Did Bill Clinton attend college?" was asked. The system began only with the knowledge that Clinton is a human name. This was represented in the knowledge representation language, KL-SNOWY, as:
(CLINTON (INSTANCE-OF (HUMAN-NAME)))
Upon being asked the question, the system analyzes the article on Bill Clinton and its final knowledge structures are:
(CLINTON (INSTANCE-OF (HUMAN)))(NAME-OF (BILL-CLINTON))) (BILL-CLINTON (HAS-NAME (CLINTON)) (JOIN (PUBLIC-SCHOOL ($MORE (@A11)))) (ATTEND-INSTITUTION (ROMAN-CATHOLIC-SCHOOL ($MORE (@A12))) (PUBLIC-SCHOOL ($MORE (@A13))) (GEORGETOWN-UNIVERSITY ($MORE (@A42) (@A43))) (COLLEGE ($MORE (@A48))) (YALE-LAW-SCHOOL ($MORE (@A82)))) (POSSESS (@X26 ($MORE (@A28)))) (GRADUATE-R%BY ($UNKNOWN ($MORE (@A47))) (SOMEBODY ($MORE (@A68)))) (PTRANS (BILL-CLINTON ($MORE (@A81))) ($NULL ($MORE (@A91)))) (PTRANS%BY (BILL-CLINTON ($MORE (@A81)))) (RETURN-TRIP ($NULL ($MORE (@A90)))))))
The concept CLINTON has changed because the system has learned that CLINTON is the name of an instance of a particular human, Bill Clinton. All of the relations involving Bill Clinton are stored under the concept BILL-CLINTON. The first relation, (HAS-NAME (CLINTON)), is an inverse relation which points back to the CLINTON concept. Following this relation are others which represent knowledge about Bill Clinton joining public school, and attending Roman Catholic School, Georgetown University, Yale, etc. KL-SNOWY represents n-ary relations through the use of action structures, such as @A11 above, which contain the other arguments of the relation. While a complete explanation of KL-SNOWY's syntax and expressive power could not be accommodated here (see [Gomez & Segami, 1991] for a discussion of KL-SNOWY), the important thing to see is that more has been acquired than simply the answer to our original yes-no question.
During its processing, the system acquired 70 new concepts and relations about Clinton which can then be used to answer the initial question. Shown below are the actual answers provided by the system.
Note that the second question is answered without the system having to consult the encyclopedia a second time because the necessary knowledge was already integrated into LTM. Other examples of questions posed to the system and its generated answers are also shown.
These examples were intended to give a feeling for what the system can do. A comprehensive discussion of its capabilities is provided in the next section.
The algorithms and domain knowledge mentioned within this paper were created and tested during the second half of 1996 and the spring of 1997. Three separate tests where run to determine the effectiveness of this approach for automatically acquiring knowledge. The test suite included 1) a clause interpretation test which measured how well the system interpreted clauses containing verbs related to the questions we sought to answer, 2) a nominalization test which measured how successful the algorithms were in determining the underlying verbal concept of the nominalization and determining the attachment of prepositional phrases to the nominalization and their meaning, and 3) a question answering test which measured how likely the system was to produce correct and complete answers from over 70 randomly chosen articles. The testing procedure and results of the second and third tests are explained below. The results of the first test, not presented here for the sake of brevity, can be found in [Hull, 1997].
The nominalization algorithms described in [Hull & Gomez, 1996] were tested to determine how successful they disambiguated the nominalization, recognized the underlying verbal concept of the nominalization, and filled its thematic roles. The algorithms assume the existence of rules for disambiguating the root verb of each of the nominalizations, as well as the mapping rules for those syntactic constructions which are specific to the nominalization. The verb disambiguation rules had already been written as part of our ongoing research, and therefore, the effort needed to handle the nominalizations of these verbs was quite small. Moreover, a list of proper nouns representing proper names was used for recognizing people and locations.
The results of the testing are shown in Table 1. Ten nominalizations were selected randomly from a list of nominalizations with at least 20 occurrences in 5000 biography articles from the World Book Encyclopedia. The column n shows how many occurrences of the nominalization were found in those articles. The algorithms were applied to each occurrence, and the results of the interpreter were examined to see if the nominalization was correctly disambiguated, if the genitive and the rest of NP was correctly interpreted, and how successfully the algorithms interpreted prepositional phrases modifying the nominalization.
The results in Table 1 illustrate the strengths and the one limitation of the algorithms. The correct sense of each nominalization was selected more than 70% of the time, with the worst disambiguation score, 72%, occurring when testing ``control," the most ambiguous nominalization with 11 WordNet senses. Failures to disambiguate were most often caused by situations where the verb rules could not be directly applied. For example, in the sentence Court was noted for her endurance and control, nothing triggers any of the verb rules. Further, because ``control" has both verbal and non-verbal senses, one can't assume that this is an instance of either one. Other disambiguation errors resulted from rules that didn't fire or selected the wrong verbal concept, or that missed a non-verbal sense. On the whole, however, these algorithms provide an effective means of nominalization disambiguation.
In comparison, Table 2 shows the results of applying a simple algorithm for nominalization disambiguation. This algorithm selects the most frequent sense of the noun under consideration from version 1.5 of WordNet. The most frequent sense was effective for four of the ten nominalizations tested, but fared poorly on the other six. Clearly, the poor showing of this strategy shows that choosing the most ``frequent" sense is successful only when WordNet's idea of most frequent agrees with the use of that sense in the encyclopedia article. Moreover, establishing which sense is used does not provide a means for determining the thematic roles of the nominalization.
The results of determining the thematic roles of deverbal nominalizations are given by the next three columns of Table 1. The thematic roles of genitives were found 93% of the time, showing how regular genitives are. The only statistically relevant problem involved two possessives used together, as in ``his party's nomination" or ``their country's trade." This problem could be easily handled in a general manner.
Interpreting the other elements of the noun phrase shows a limitation of the algorithms, which shouldn't be surprising considering the difficulty of NP interpretation. The most significant problem was the interpretation of adjectives which do not fill thematic roles but portray a manner of the action. Examples include ``sole control," ``tight control," ``profitable trade," ``mass murder," and ``powerful defense." Related to this problem are other adjectives which are not manners of the action but could not be interpreted as thematic roles, e.g., ``foreign trade," ``extraordinary breath control," and ``important capture."
PPs were correctly attached and their meaning determined over 90% of the time. This shows that the verb's mechanism for handling PPs can be readily used by the nominalization interpretation algorithms. Most failures were due to ambiguous PP heads and cases where the nominalization took prepositions different from the verb, which were unanticipated.
The most comprehensive test of the system was the question answering test. This test not only shows the utility of this work, but to perform well, every aspect of the system must work, making it the most rigorous test.
Having previously defined the questions the system was to answer, the first step of this test was to determine about whom each question was going to be asked, i.e., the topic for each question. Topics could not be selected randomly because not every article contains an answer to every question. Therefore, articles were selected which had answers to the particular question under study.
Next, two human subjects with no linguistics or computer background were given the same CD-ROM version of the World Book Encyclopedia that the program used, and were asked to answer each of the questions. They were allowed to spend as much time as necessary to answer the questions completely. Answering a question requires reading each article, hence, the humans spent approximately 12-15 hours to perform this task for all ninety-one questions. The result of this effort was a list of 168 consensus answers, which became the standard that the program was compared to. The humans' answers for the questions are presented in [Hull, 1997].
The system was then asked the same ninety-one questions. The total time it took the system to answer the same questions was 1099.9 seconds or just over 18 minutes! The questions and the results are shown in series of tables listed in [Hull, 1997] as are the program traces for the first twenty questions.
The output of the program was compared to the human answers and was judged to fall into one of five categories:
Recall and precision scores, common metrics used in measuring the success of information retrieval and information extraction systems, were calculated from the number of answers that fell into each category and the total number of possible answers (POS) and actual answers produced by the program (ACT).
These statistics were calculated for each question and are presented in [Hull, 1997].
The results of this testing show that a significant amount of knowledge can be acquired from encyclopedic texts. Human subject 1 (HS1) found 259 answers to the ninety-one questions. Human subject 2 (HS2) found 196 answers. The answers produced by both human subjects, the consensus answers, numbered 168. It was against this set of answers that the system's answers were compared. Thus, the system was attempting to match the answers that one would expect most any person to find. SNOWY-BIOS, the extension of SNOWY for handling biographical articles, produced 137 answers, of which 126 were either correct (108) or partially correct (18). Only eleven answers were incorrect. The precision of the system was calculated to be 85%. Therefore, when the system presents an answer, it is very likely to be correct.
The recall of the system was calculated to be 54% when measured against the consensus answers because of the 168 answers in the consensus set, the system found 94, 86 correct and 8 partially correct. The system also produced 32 additional answers, 22 correct and 10 partially correct, not contained within the consensus set. Therefore, a comparison of the number of correct answers produced by the system to the number of consensus answers shows that SNOWY-BIOS produced 70% of the total number of answers produced by the humans. These figures are quite remarkable, considering what is involved in generating this score. The system must be able to make sense of the query, which means it must be successfully parsed and interpreted, to select the appropriate article from the encyclopedia. From this article, the skimmer must select the sentences which contain the answers. Each of these sentences must be successfully parsed and interpreted, with both explicit and implicit knowledge being formed into KL-SNOWY representation structures. These structures must then be integrated properly into LTM. Finally, the correct structures must be retrieved by the question answering system to answer the original query. This rigid measure of success is extremely challenging, and we know of no other published results describing a system of this nature.
If the task was not hard enough, some of the human answers were clearly beyond the current state of the art. The human's answers to general questions such as What did Sir Borden achieve? and What did Lenin believe in? showed a tremendous amount of common sense reasoning. For example, one answer produced by HS1 to What did Borden achieve was ``he became a teacher at age 14." Handling achievements of this sort in a general manner would require knowledge of the typical and exceptional ages of people holding various social roles. The problem is not so much encoding this knowledge now, but in anticipating its importance beforehand.
The two scores of recall and precision are often combined into one score, called an F-measure. This one value attempts to normalize the tradeoff between recall and precision, i.e., one can usually increase recall at the expense of precision and vice versa. The formula for F-measure is shown below.
The parameter, , is the relative importance of recall vs. precision. Three values of are commonly used: = 1.0 treats recall and precision equally; = 0.5 or 2P&R which emphasizes precision twice as much as recall; and = 2.0 or P&2R which emphasizes recall twice as much as precision.
The F-measure scores clearly show that the advantage of this approach lies in its very high precision. The highest F-measure scores posted during the 4th Message Understanding Conference (MUC-4) were 57.05 (P&R General Electric), 54.92 (2P&R, Univ. of Massachusetts), and 60.07 (P&2R, General Electric)[Sundheim, 1992]. These scores were determined for the task of extracting information about South and Central American terrorist activities from real newswire texts. We have presented the MUC-4 results, not for direct comparison because their task and ours are very different, but to establish numbers which reflect the current state of the art.
The MUC-5 competition used a new measure for comparing machine and human performance, that of error. Error is defined as
The error rates for the best systems in the MUC-5 competition were in the upper 60's (less is better), while our system rated a score of 45.
Opportunity for improving the system's f-measure score is in increasing its recall. As of spring 1997, the parser was producing one correct parse for 70-75% of the sentences fed to it. Therefore, 0.75 is the best recall score we could realistically hope to achieve without an effort to improve the parser and the pre-processor. By examining the sources of missed answers, we believe that raising the recall score towards 0.75 is certainly possible.
Answers were missed for several reasons. Ten answers were missed due to the fact that the program was not able to recognize the titles of publications. For example, both human subjects found four answers in the next two sentences from the Conrad Richter article (prompted by the question What did Richter write):
He is best known for The Awakening Land, a trilogy about a pioneer family in Ohio. It consists of The Trees (1940), The Fields (1946), and The Town (1950).
Because the system did not recognize ``The Awakening Land" as written communication, it didn't infer that Richter wrote it. And even if it had, it would have been hard pressed to realize that ``The Trees," ``The Fields," and ``The Town" were the three parts to the trilogy.
At least four answers were missed because the system didn't anticipate some ways that an answer could be implied, and consequently the sentences with the answers were not even selected. The four missed answers were from the question Where was John Quincy Adams sent. Both human subjects found answers in these sentences:
In hindsight, three out of four of these answers could have been produced if an inference rule had been written for the verbal concept appoint. Several answers were missed due to failures of the interpreter when confronted with conjoined constituents. Pre-processing and parsing errors led to several more. Each of these problems appeared to be either lexical in nature, or due to some failure (bug) of the code.
Comparison of the system's performance to that of the human subjects revealed one additional strength of the program. On longer articles, the human performance noticeably degraded either due to boredom or a lack of stamina. It was here that the tireless system was able to find answers that humans had overlooked.
We have been unable to find any systems described in the literature which automatically acquire knowledge from encyclopedic texts. The closest we have found is MURAX[Kupiec, 1993], a program which attempts to answer Trivial Pursuit questions from an on-line encyclopedia using shallow natural language analyses and heuristic scoring of ``answers hypotheses". This approach does not acquire knowledge, i.e., knowledge representation structures, but returns to the user a ranked list of noun phrases likely to answer the question.
Several knowledge-based approaches to interpretation of nominalizations can be found in the current literature. PUNDIT is a system for processing natural language messages, which was used for understanding failure messages generated on Navy ships[Dahl et al., 1987]. Nominalizations in PUNDIT are handled syntactically like noun phrases but semantically as clauses, with predicate/argument structure. In fact, PUNDIT uses the same decomposition as the associated verb. Special nominalization mapping rules are used to handle the diverse syntactic realization of constituents of nominalizations. Some components of our approach are similar; nominalizations inherit selectional restrictions and syntactic mappings from their associated verbal concepts and can have their own specialized mappings when appropriate. PUNDIT avoids handling the ambiguity of the nominalization, including the ambiguity between the verbal and non-verbal senses and the polysemy of the nominalized verb. KERNEL[Palmer et al., 1993], a successor of PUNDIT, treats nominalizations in much the same way. Voorhees[Voorhees, 1993] and Li et al.[Li et al., 1995] both use WordNet as a source of disambiguation information, but neither addresses interpretation of nominalizations.
Given the results of the tests described above, it is clear that knowledge can be acquired automatically from encyclopedic texts regarding historical people. Two problems present in this domain, deverbal nominalizations and diverse relationships, were overcome in order to answer many questions on par with human performance. We hope to extend this work further to encompass other domains within the World Book and other on-line encyclopedias.