Beyond Full-text Search: AI-Based Technology to Support the Knowledge Cycle

David M. Steier, Scott B. Huffman, Douglas I. Kalish

Price Waterhouse World Technology Centre

68 Willow Road

Menlo Park, CA

Abstract

From the mounds of raw information available electronically today, what professionals really need are targeted, timely nuggets of knowledge that can guide the solution to business problems. Today's common information tools -Web full-text search engines and the like - do not fully support this conversion of raw information into knowledge. In examining the common knowledge management problems faced by Price Waterhouse professionals, we have found that converting information to knowledge requires not only finding raw information, but also filtering through it for relevance, formatting it appropriately for the knowledge task at hand, and forwarding it to the right people. A fifth stage, feedback from the users, can allow the effectiveness of each stage to increase with time. In this paper, we describe each stage of this knowledge cycle and discuss the potential role that AI-based technology can play in its automation. We illustrate the possibilities through case studies of deployed knowledge management tools we have built at Price Waterhouse. These tools demonstrate that for targeted business tasks, AI-based technology can potentially facilitate much of the knowledge cycle, providing users with useful business knowledge that provides competitive advantage.

Introduction

In recent years there has been an explosion in the availability of electronic information. The World-Wide Web, newswire feeds, SEC filings and other corporate reports, government documents, litigation records, and much more are all available electronically and inexpensively. However, this treasure-trove of raw information has proven difficult to exploit. It is often difficult to find the information relevant for a particular task or decision. Even if relevant information can be found, it is often in the wrong form, requiring significant collation, reorganization, etc., to be useful. Information from different electronic sources must be combined, and this can require time-consuming conversion and normalization to make figures comparable and terminology consistent. Finally, although it is easier than ever to share information electronically (through email, electronic bulletin boards and databases, etc.), in a large organization it can be difficult to get new information into the hands of those who could use it best.

In our experience at Price Waterhouse, AI-based technology can play a key role in dealing with these difficulties in managing knowledge. In this paper, we introduce the concept of the knowledge cycle - the path from raw information to useful knowledge - and use it to highlight the key technological needs in knowledge management. Information tools that are typically used today, such as full-text search engines on the Web, are useful to a point, but support only the early stages of the knowledge cycle. To demonstrate what can be done for the other stages, we present case studies of knowledge management tools that we have developed and deployed for specific business tasks within Price Waterhouse. Our goal, however, is not so much to describe the specific techniques, algorithms, etc., used by these tools - they have been described elsewhere - but rather to illustrate how well-targeted AI-based technology can significantly impact knowledge management problems in large organizations. The paper concludes with several principles to inform the design of future applications that draw useful, targeted business knowledge of various kinds from large volumes of raw information.

Figure 1: The Knowledge Cycle Value Chain

The Knowledge Cycle

Anyone who has had a full-text search tool give them thousands of "hits" in response to a query will agree that transforming raw information into knowledge involves much more than searching for a few words or phrases. Rather, the transformation can be broken into a cycle of four general stages:

The Find/Filter/Format/Forward stages represent a general "value chain" in converting any information into knowledge, as shown in Figure 1. For a given information to knowledge transformation, the effectiveness of each stage and of the cycle as a whole can be evaluated by performance measures including:

As an organization acquires experience in converting information to knowledge, it can function more effectively as future knowledge needs arise. Because the set of information sources and knowledge needs is diverse and constantly changes, it is impossible to anticipate all of the processing that will be required in the knowledge cycle for a given organization. A fifth stage, feedback, may provide the ability to adapt the first four stages to new circumstances. Feedback evaluates the performance of the previous stages in terms of performance measures such as

Figure 2 Continuous Improvement in the Knowledge Cycle

those mentioned above. Together, the five processes act on the corporate memory as shown in Figure 2.

To manage knowledge effectively, organizations must identify their highest value knowledge-based tasks, and for those tasks, they must identify and address the major bottlenecks in the knowledge cycle. Some of these bottlenecks will be organizational, and can be addressed by restructuring, retraining, corporate policy, etc. Other bottlenecks will be addressable by technology. Below, we discuss how technology can reduce bottlenecks in each stage of the knowledge cycle. We ground the discussion by presenting case studies of two applications we have developed and deployed at Price Waterhouse that automate the knowledge cycle for specific business knowledge needs.

Technology Enablers for the Knowledge Cycle

What technology will impact the knowledge cycle in a large organization like Price Waterhouse? We will begin with some very general answers, and then examine the case studies. General technology issues for each stage include:

Because of the diversity of knowledge cycle tasks, these general issues are hard to grasp outside of the context of specific applications. Next, we will describe two sets of these applications as case studies to ground our discussion of technology to support the knowledge cycle.

Case study #1: EDGAR data and financial benchmarking

Price Waterhouse, like many financial and consulting organizations, makes heavy use of the various types of information contained in the corporate SEC Filings of U.S. public companies. In the past, Price Waterhouse has spent large amounts obtaining filings in paper form from third-party data providers, and thousands of hours of staff time searching through these filings, rekeying, analyzing and formatting financial statement information and other portions of the text into reports, spreadsheets and charts. The information is used for a number of purposes, primary among them the analysis of company financials, and the benchmarking of one company's financials against other comparable companies within their industry. In recent years the SEC has made corporate filings available electronically through their EDGAR program. Because of PW's heavy use of this information and the large cost of performing financial benchmarking manually, this was a good target for the application of technology to the electronic filings.

For SEC filings, the knowledge cycle takes the following form:

We have built specialized technology for each of the Find/Filter/Format/Forward stages for this knowledge task, focusing in particular on the technology needed to find and interpret financial tables. We observed that keyword-based processing (for instance looking for phrases such as "total revenues") was not powerful enough alone for substantial automated analysis, but was much more effective when used in combination with expectations of the structure of SEC forms and of financial statements. In particular we found that incorporating knowledge of expected arithmetic relationships between line items, both within and across tables, allows for very precise interpretation of financial statements. We call this technique constraint-driven table parsing.

The technology that parses EDGAR filings has been deployed as the basis of several applications within PW. The EDGAR filings database is a Lotus Notes database containing an index to the over 20,000 filing entities so that people can find companies and request filings, portions of filings, or benchmarking information. This database has thousands of users within Price Waterhouse and processes over 200 requests a day. Several other tools provide benchmarking capabilities. EdgarScan (accessible on the Web as http://edgarscan.tc.pw.com) offers the user the ability to access filing sections, analyzed financial statements and their footnotes, financial ratios, and some rudimentary charting abilities. A Windows-based application called Benchmarking Assistant provides our practice with the ability to perform more sophisticated benchmarking based on data from EDGAR and CompuStat®. Another specialized benchmarking capability allows members of PW's Tax practice to request a detailed comparison of companies' tax reconciliation tables, found in the income tax footnotes of their annual Form 10-K filings. The results are returned in the form of a spreadsheet, showing the components of the tax reconciliation, such as state taxes, normalized to a percentage for easy comparison, and a hyperlinked file showing the tax footnotes as they appear in the 10-K. With all these applications, automating search, extraction, normalization, and formatting of SEC documents has given PW a substantial gain in efficiency, and consequent reduction in "time-to-market," for new knowledge drawn from financial statements.

Case study #2: Management changes extracted from newswires

In addition to corporate financials, it is important for PW professionals to track other business events as they occur. One type of event that is particularly important for PW is executive management changes at large companies. These are reported in press reports and newspaper articles, available electronically via newswire feeds.

Due to the huge volume of newswire articles each day (one service PW uses provides over 5000 articles daily), it would be extremely difficult to track management changes manually. Conventional technology, such as full text search, can reduce the number of articles but cannot pinpoint those that definitely contain management changes, and does not extract relevant information such as the company and person involved. Third-party "clipping services" can find the relevant articles, but often not in a timely manner, and not with relevant information extracted and searchable in an electronic form. Therefore, we decided to build specialized technology for tracking business events like management changes.

For this task, the knowledge cycle takes the following form:

The ODIE extraction system [Huffman, 1996] has been the key to automating the management changes knowledge cycle. ODIE exploits the fact that the language used in business news articles is stylized, and uses a relatively small number of syntactic patterns to express most instances of particular types of business events. The system performs a shallow, efficient linguistic analysis of newswire texts, to find specifically those syntactic patterns that indicate a business event such as a management change. Extraction patterns are represented as paths through a non-deterministic finite state machine; embedded finite state machines are used to recognize syntactic relationships. The overall technique has some similarities to SRI's Fastus [Hobbs et al., 1992] and UMass's CIRCUS [Lehnert et al., 1993] extraction systems. ODIE has been applied to filter both European and US newswires. In addition to management changes, we have experimented with filtering for other types of events, such as corporate acquisitions.

An important bottleneck in applying extraction systems like ODIE to the extraction of new events is determining the set of extraction patterns that indicate the event's presence in the newswire texts. As a step towards overcoming this difficulty, we have produced a feedback system for ODIE. The feedback system, called LIEP, allows a user to input examples of texts paired with the events that should be extracted from them [Huffman, 1996]. LIEP analyzes each input text and uses a combination of analytical and inductive machine learning techniques to induce patterns that will extract the indicated events. Thus, ODIE can be "trained" to extract new kinds of events without further programming effort. We used this feedback method, for instance, to train ODIE to extract simple corporate acquisitions. Related systems for learning text extraction patterns include AutoSlog [Riloff, 1993], AutoSlog-TS [Riloff, 1996], PALKA [Kim & Moldovan, 1995], and CRYSTAL [Soderland et al., 1995].

Discussion

To date we have focused primarily on automated assistance for filtering and formatting, and less for the other portions of the knowledge cycle; this balance will shift in future work. We are particularly optimistic, based on our experience with the management changes task, about using Feedback to shorten the technology development life cycle. We plan to explore the potential of Feedback in developing applications that track events such as new legislation, new product announcements or industry developments, etc.

Generalizing over these applications, the following principles emerge that provide guidance for future development of knowledge management tools:

These principles, together with the knowledge cycle introduced in this paper, provides a framework for using technology to remove the bottlenecks in efficient knowledge creation and use. Our case studies have demonstrated the importance of considering technological support for all stages of the knowledge cycle. Efficiencies at each stage multiply out to reduce the amount of user effort by several orders of magnitude. For instance, the management changes application reduces the effort from reading five thousand articles received on a daily newswire to only looking at a few management change events with the right information extracted and categorized. The reduction in effort translates to an earlier time to market with new knowledge. Similarly, people within Price Waterhouse can get the financial data extracted from EDGAR filings weeks or even months before it is available from on-line services. Granted, the extensive task analysis and support required for knowledge cycle automation can be arduous and is only worth undertaking for very high-value applications. For wisely chosen applications, however, organizations that invest in appropriate AI technologies for knowledge management - parsing, information extraction, intelligent search, user profiling & document forwarding, and the like - will receive major returns.

References

  • [Bloedorn et al., 1996]. Bloedorn, Eric, Mani, Inderjeet, and MacMillan, T. Richard, "Machine learning of user profiles: Representational issues," in Proceeding of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), AAAI Press, 1996.
  • [Harman, 1993]. Harman, Donna, "Overview of the first Text Retrieval Conference," in Proceedings of the sixteenth annual ACM conference on research and development in Information Retrieval, pp. 36-47. Association for Computing Machinery, 1993.
  • [Hobbs et al., 1992]. Hobbs, J. R., Appelt, D. E., Bear, J. S., Israel, D. J., and Mabry Tyson, W. "FASTUS: A system for extraction information from natural-language text." Tech. Report No. 519, SRI International, 1992.
  • [Hobbs, 1993]. Hobbs, J. R. "The generic information extraction system," in Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, San Mateo, CA, 1993.
  • [Huffman, 1996]. Huffman, Scott B. "Learning information extraction patterns from examples", in Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, ed. S. Wermter, E. Riloff, and G. Scheler, pp. 246-60. Springer-Verlag, Berlin, 1996.
  • [Huffman and Steier, 1995]. Huffman, Scott B. and Steier, David, "Heuristic joins to integrate structured heterogeneous data," in Working Notes of the AAAI Spring Symposium on Information Gathering in Heterogeneous Distributed Environments. American Association for Artificial Intelligence, 1995.
  • [Kim and Moldovan, 1995]. Kim, Jun-Tae and Moldovan, Dan I. "Acquisition of linguistic patterns for knowledge-based information extraction." IEEE Transactions on Knowledge and Data Engineering, 7(5):713-24, 1995.
  • [Krulwich and Burkey, 1996]. Krulwich, Bruce, and Burkey, Chad, "The ContactFinder agent: Answering bulletin board questions with referrals" in Proceeding of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), AAAI Press, 1996.
  • [Larkin and Simon, 1987]. Larkin, Jill H. and Simon, Herbert A., "Why a diagram is (sometimes) worth ten thousand words." Cognitive Science, 11(65-99), 1987.
  • [Riloff, 1993]. Riloff, Ellen. "Automatically constructing a dictionary for information extraction tasks" in Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93), pp. 811-16, 1993.
  • [Riloff, 1996]. Riloff, Ellen. "Automatically generating extraction patterns from untagged text" in Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), AAAI Press, pp. 1044-49, 1996.
  • [Soderland et al., 1995]. Soderland, S., Fisher, D., Aseltine, J., and Lehnert, W. "CRYSTAL: Inducing a conceptual dictionary" in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pp. 1314-9. Morgan Kaufmann, 1995.