Managing Inductive Knowledge

Experimental Knowledge Systems Lab

Computer Science Department

University of Massachusetts

Amherst, MA 01003-4610

jensen@cs.umass.edu

Tools for inducing knowledge from databases, often grouped under the termknowledge discovery, are becoming increasingly important to organizations in business, government, and science. However, relatively little attention has been paid to the long-term management of induced knowledge. Induced knowledge presents unique challenges, including managing statistical significance and inductive bias. These two challenges have important implications for valid and efficient knowledge management.

Typically, these systems are concerned with

This paper argues that induced knowledge has at least two unique characteristics, and that these characteristics impose special requirements on knowledge management systems. The first characteristic concerns

If knowledge management systems contain inductively derived knowledge, but fail to account for these unique challenges, they will fall prey to several pathologies. These include faulty estimates of validity, missed opportunities to discover useful relationships, and redundant search efforts.

The remaining three sections support these claims. The first two sections introduce statistical significance and inductive bias, provide examples, and present implications. Readers who already understand these concepts may wish to skip the front portions of these sections, but they are provided for completeness. The third section discusses briefly system design issues in the context of these characteristics.

A reference distribution indicates the frequency of a statistic's values that would be expected under the

As shown schematically in Figure 1, 5.9% of the reference distribution for G is equal or greater to 3.55, indicating that , where is the null hypothesis. The probability p can be very small, but it is always non-zero.

The probability p is distinct from what could be called the

Statistical significance is also distinct from the probability that a particular model is ``correct.'' It is a mistake to think that, merely because a model is statistically significant, that it is necessarily correct. Indeed, the actual relationship could have a different functional form than the induced model, less (or more) inferential uncertainty, different parameter values, additional (or fewer) variables, latent variables, and many other differences.

Instead of a guarantee, statistical significance is only a minimum indicator of validity. If an observed relationship can be explained entirely as chance variation (e.g., p is very large), then there is little need to investigate further. If p is very small, then additional questions about the form and content of the relationship are worth investigating.

The discussion above suggests a design requirement for knowledge management systems: an estimate of p should be calculated and stored along with knowledge that has been derived inductively. This estimate can be used, along with other information, to judge the validity of an induced model. Different uses may imply different desired levels of statistical significance. For example, medical treatments that are expensive or dangerous might be required to meet higher standards of statistical significance than treatments that are cheap and relatively benign.

These factors raise serious issues for knowledge management. The complexities arise because

The dependence on data is reasonably obvious. The probability

Equation 2 is one of a class of Bonferroni equations, commonly used to adjust statistical tests for multiple comparisons and more recently applied to induction algorithms. The adjustment is necessary because the reference distribution for G is constructed under the assumption of a single comparison of a model to a data sample. Making multiple comparisons renders this reference distribution inaccurate.

A Bonferroni equation assumes that the comparisons are

In addition to a Bonferroni equation, there are several other techniques that can be used to compensate for multiple comparisons even when those comparisons are non-independent. These include randomization tests, a method of empirically constructing reference distributions based on analyzing random data, and cross-validation, a method of systematically providing fresh data for evaluating the results of extensive search.

*Unintentional data reuse:*A model M is derived based on a sample of data D. It is stored without any references to data. Later, M is tested on data to verify its accuracy. Without records of how M was derived, it would appear that M has been independently verified. Unfortunately, M was ``verified'' on D, the same sample used to derive it. Potential mistakes of this kind can only be avoided if a link is maintained to the original data used to derive a model.*Uncoordinated, distributed search:*Twenty analysts work independently on data sample D, each evaluating the accuracy of a different model. One analyst's model is statistically significant at the 10% level. Without the information that other analysts are conducting similar analyses, it would appear that a significant relationship has been identified. Considering the actions of all the analysts (e.g., by using equation 2), the result is not statistically significant. This indicates the importance of maintaining records of the uses of different data samples.*Ignoring sample size:*Two models and are induced and stored with estimates of their inferential uncertainty -- the percentage of incorrect predictions. Later, they are compared and found to be equally useful. Unfortunately, model 's estimate was based on data sample with 1000 instances; model 's estimate was based on data sample with only 10 instances. While the two models have identical inferential uncertainty, the first estimate is far more reliable. Judgments of this kind can only be made if a knowledge management system retains some information about statistical significance or the data sample used to derive a model.*Incremental induction:*A model is developed on a small data sample and, while suggestive of an interesting relationship, it does not exceed a prespecified critical value. Another small sample of data becomes available later, but it is also too small to confer statistical significance to the model. However, the relationship would be significant if considered in the context of both data samples together. This indicates the importance of maintaining both tentative models and links to the original data.

Machine learning researchers label these factors

One of the simplest factors that inductive bias can express is the intensity of search. If we know that an algorithm has examined only a few potential models, we may wish to devote additional resources to searching a larger space. In contrast, if an algorithm examines a large search space, and can make guarantees about finding accurate models within that space, then we can eliminate that space from future analyses that use the same data, and concentrate on other potential search spaces.

Particular inductive biases can be appropriate or inappropriate for particular domains. Most obviously, if some important relationships cannot be represented within the language an algorithm uses to express models, then no amount of searching will find those relationships. In addition, some forms of procedural bias are effective within some domains, but not in others.

*Misspecified Search:*Other sources of knowledge in a particular domain (e.g., domain experts) indicate that useful knowledge will be of a specified form. An analyst might apply a particular algorithm with the expectation that it examines models of a particular form, when it actually does not. Information about the algorithm's bias would help determine what space of models it will search.*Redundant Search:*A data sample is analyzed with induction algorithm . Later, an attempt is made to extend the previous analysis by searching with algorithm . Unfortunately, the two algorithms have almost precisely the same inductive bias, making the second search redundant. Clear specifications of each algorithm's inductive bias could be used to prevent such redundancy.*Oversearching:*Recent results comparing induction algorithms employing heuristic search techniques with algorithms employing exhaustive search have shown that, paradoxically, algorithms using heuristic search produce models that are more accurate on new data. This phenomenon can be explained as an effect of multiple comparisons. Being able to account for the effects of multiple comparisons relies on being able to accurately characterize the search spaces of different algorithms.

*The size and identity of data samples*used to induce particular models. That is, data management and knowledge management need to be linked.*The number and types of models*examined by induction algorithms. That is, induction algorithms and knowledge management need to be linked.

In many situations, however, long-term management of induced knowledge is desirable. We wish to build on previously induced relationships and make use of data and computational resources in the most efficient way possible. How can knowledge management systems provide the information needed to do this, without requiring knowledge management to be deeply integrated with other systems?

One way is to divide functions into components for knowledge management, data management, induction, and performance:

- The
*knowledge management*component stores, organizes, and facilitates maintenance of represented knowledge. Each model contains a record, interpretable by the data management component, of the data sample used to induce it and a record, interpretable by the induction component, of the bias used to induce it. - The
*data management*component stores data used by the induction component, and provides records of the samples used to induce particular models. - The
*induction component*creates new models and provides records of the inductive bias used to induce them. - The
*performance component*makes inferences based on models in the knowledge management component.

Records of inductive bias are somewhat more problematic. Part of the inductive bias concerns representation language -- a constant for any individual induction algorithm. However, a compact record of the path of a heuristic search is not so simple to achieve. At a minimum, induction algorithms could record the raw number of models examined during search and rough limits of the search (e.g., the depth of a decision tree or the number of separate rules in an induced ruleset). Some interactive approaches to induction (e.g., visualization) have an inductive bias that is almost impossible to characterize. Even in these cases, however, records could be kept about the number and types of relationships explored.

Clearly, this discussion only sketches how a knowledge management system might be designed to accommodate inductive knowledge. However, it identifies some key characteristics of such a system -- links to both the induction and data management systems. Given the implications of statistical significance and inductive bias, these characteristics would seem essential to a system that effectively manages inductive knowledge.

© David Jensen