There exist very few methodologies for ontology acquisition [for example, Uschold, 1995; Borst&al, 1996]. In fact, in recent years ontological frameworks concentrated mainly on formal languages for expressing ontologies (ontology representation). We argue that the main reserve in discussing the methodology issue is linked to the difficulty of solving such problems as modelling stopover and knowledge relevance.

2.1 Bottlenecks in ontology acquisition: relevance, stopover and task

How do we state what is conceptually relevant? and when to stop detailing the explicitation of a conceptualization, i.e., when do we stop over refining an ontology? These problems have appeared in various forms in the past. The classic distinction between terminologic and assertional knowledge in KL-ONE languages tried to overimpose a formal criterion on an ontological problem, but what is terminological? How to state the border? (An interesting study on the difficulty of representing domain (medical) knowledge in such environment is [Haimowitz, Patil, Szolovits, 1988]).

Analogous forms of the stopover problem are the linguistic debate between dictionary-type and encyclopaedic-type definitions [Eco, 1984], or the logical riddle between analytic and synthetic categories [Ajdukiewicz, 1958].

[Marconi, 1994] gives an account of the problem in terms of a plausibility metrics, and this reflects the fact that relevant/non-relevant is a matter of degree, as explicitly admitted in the linguistic and logical works quoted above. Other accounts have been given by evoking contextual solutions (for instance, the contextual triggering of CYC [Lenat, 1990], and contextual logics [McCarthy, Buvac, 1994; Bouquet, Giunchiglia, 1995].

The interest of such approaches notwithstanding, one still lacks an ontological criterion. This is quite obvious, because relevance and stopover are dependent on task, and tasks can be modelled only for specialized, very limited, and conventional protocols of planning and acting. In other words, relevance and stopover need a distinction between local and global conceptualization (par. 1.1).

Our solution is thus to integrate knowledge sources which have been developed by experts for given tasks with consequent contextual, local stopovers. Moreover, local stopover methods share an implied judgment of relevance for the concepts to be selected in the repositories.

Obviously, not all domains allow such a solution. We tried with medicine and the results seem encouraging.

2.2. Basicality

We said that stopover, and then relevance, depend on task. Some cognitive science results show that task-relevance translates --within cognition-- in familiarity, affordability, or basicality. Basicality is a notion originally used for everyday's natural language, after the indications that basic concepts are neither top level nor terminals [Rosch, 1976; Lakoff, 1990]. For example, a basic concept in the semantic field concerning chairs, sitting, etc. could have chair as basic, chaise-longue as overdetermination, and furniture as subdetermination.

Several strategies can be followed for evaluating basicality of concepts; for example, a concept is more basic when is correlated to a gestalt (par.1.1), either a perceptual, cultural, or linguistic gestalt; thus, a concept is more basic as far as it has:

1) a rich perceptual and sensori-motor description (rich associations in its domain);

2) a unique, highly disposable mental image (providing fast identification);

3) much knowledge/cultural functions organized around it;

4) many linguistic gestalts: terms, synonyms, and syntagmatic combinatory; and as far as it is:

5) lexically frequent/productive/neutral;

6) early named/acquired/understood by beginners.

Even though basic concepts are primarily those of ordinary common sense, also special domains exploit basicality for creating prototypical effects and highly ranked, 'central' terms.

The six tests on basicality of concepts require both psychologic, anthropologic and linguistic investigation. Nevertheless, scientific domains usually have a rich production of linguistic standardized repositories, thus the test on 4) is the most accessible and we can suppose that the most of the other tests should be positive when 4) is positive. That is why we have undertaken the analysis of taxonomic terminologic repositories in medicine.

However, scientific basicality is not an invariant: not only does it depend on the particular discipline, but it depends on granularity as well. Granularity is extremely relevant just because of scientific method: structuring procedural tasks, creating layers, modelling layer frameworks, building reduced, formal worlds in which the scientist must learn a "specialized common sense". Abovementioned strategies will come back in the discussion about ontological commitment of MLC (par. 4.).

2.3. Ontological integration

Integration of large knowledge bases is a main issue [Gruber, 1993; Musen, 1992; Neches&al, 1991; Fankhauser&al, 1991; Sujanski&Altman, 1994; EPISTOL, 1994], which embraces taxonomic knowledge integration as well. Although much work has been devoted to integration of data formats and even to integration of representation formalisms, a more challenging integration issue comes from the heterogeneity of intended meaning of concepts; for example, when we consider a definition of viral hepatitis:

we have to understand that inflammation may mean in different --or within a same-- taxonomic source:

All these inflammations are equally acceptable. However, one cannot figure out, from the organization of a single source, all the valid usages of that phrase in the world out there. Taxonomic sources and contexts select some issues of the global conceptualization, like pointing the finger to a site in the mind of an ideal, intersubjective expert.

Put differently, we need to move from local heterogeneity of intended meanings to multi-local, polyedric intended meanings. This move needs the comprehension of background knowledge.

We will outline our approach to the integration problem as based on the main assumption that any Domain Knowledge (DK) implies different kinds of background knowledge:

These kinds of background knowledge are related to the theory of meaning introduced in par. 1.1.

On one hand, local contexts and domain concepts should be found within semantic fields. Though, an ontological engineer only may have access to knowledge organization products (lexica, protocols). Therefore, concepts for the semantic fields of a domain are to be grasped from such sources.

On the other hand, operators for the semantic fields can be acquired from the description of general and domain theories, which are more or less formally available in the literature.

On these premises, we defined ONIONS, a methodology for the integration of heterogeneous taxonomic knowledge by means of source comparison and abstraction through general and domain theories.

Our experimental domain is medicine. Our research is variously related to others in that domain --GALEN [Rector, Gangemi, etc., 1994; GalenProject], CEN/TC251/pt003 [CEN, 1992], GAMES [Falasconi, Stefanelli, 1994], PROTEGE [Gennari, Tu, etc., 1994], and CANON [Evans, Cimino, etc., 1994].

Taxonomic sources in medicine --such as classifications, nomenclatures, semantic networks-- are dependent on ontologies, usually implicit, but coherent with specific tasks (epidemiology, indexing, retrieval, acquisition, expert systems) [RossiMori, Gangemi, etc., 1993], as suggested in par. 2.1.

ONIONS is synthetically represented in Fig. 4.

Fig. 4: A (simplified) data flow diagram of the procedure of ontological integration: from heterogeneous terminological sources to an integrated model.

ONIONS creates a common framework to generalize and integrate the definitions that are used to organize a set of terminological sources. In other words, it allows to work out coherently a domain ontology for each source, which can be then compared with the others and mapped to an integrated model. Our work has two main goals:

a) generalizing a framework to integrate terminological knowledge from various medical sources by analysing their domain ontologies, and

b) defining a new and open domain ontology which merges a general ontology with various domain ones.

The first aim of this ontological integration was to develop a core model of medical concepts within GALEN [Galen Project]. Afterwards, we enlarged and used it to support a conceptual convergence among different terminologies.

Current efforts are mainly addressed to extending the ontological knowledge base to map larger parts of the sources and to apply ONIONS to microdomain integration, where a proficuous collaboration has been activated with standardizer physicians in the microdomains of medical procedures and vital signs.

2.4. Some details on the methodology

2.4.1 Phase I): Extracting source terms At the beginning (Fig.5), the ontological engineer will select the more relevant sets of terms from terminological sources (source terms): code definitions or key-words from classifications, nomenclatures, coding systems, thesauri. This phase has hooks to corpora formation techniques and textual types definition and acquisition (not examined in this paper).

Given a corpus, the order of terms contained in each single source is inferred. The sources used to now for medical ontology are mostly taxonomies (once a semantic network, once a description logic model, sometimes flat standardized lists, see par. 2.4.2).

The order is exploited to identify the top-level concepts in the source, and then top-level concepts are used to choose a depth limit in the hierarchy (for example, in the medical ontology application, we chose to truncate the body part lexical field hierarchy --in the vessel branching-- to kinds of vessel, not including instances of arteries, veins, etc). Since our main scope was to integrate general medical taxonomic knowledge, the detailed taxa for anatomy are excluded. This seems to be sound to the extent that a specialized microdomain integration could be done in a further phase (for example, an extension of the current medical ontology ON8.5 to angiology).

2.4.2 The medical sources in ON8.5 Our research in medicine (ON8.5: see par. 3.) has taken into account five sources (terminological repositories): the UMLS semantic network [Humphries, 1992] (all ~170 semantic types and relations, and the "templates" defined on them), SNOMED-III [Coté, 1994] (~600 most general concepts) and GMN [Gabrieli, 1989] (~700 most general concepts) nomenclatures, ICD10 [WHO, 1994] classification (~250 most general concepts), and the CORE model developed by the GALEN project [GalenProject] (version 5g, non-ontologically oriented, all ~2000 items).

UMLS has a hierarchical structure, includes relations, provides free-text definitions and combinations of "types" and "relations". It has a browser but does not allow to create new assertions. It uses the MeSH thesaurus [NLM, yearly] and other nomenclatures as its 'bottom level'.

SNOMED and GMN have some general axes (partially hierarchical, with mixed inclusion, part, and associative distinctions), do not apply relations, are homogeneous between top and bottom parts. ICD is hierarchical (with inclusion distinctions), has no relations, is homogeneous between top and bottom parts. CORE model v.5g is hierarchical, applies relations, is homogeneous between top and bottom parts, and has a terminological engine which allows to compose concepts and relations --with some degrees of validity -- into canonical forms, and provides tools to debug and browse large models.

2.4.3 Phase II): Local definition of terms Once we get a relevant set of concepts for each source, we focus on the criteria of classification, as local definitions of concepts (Fig. 6), in order to create a lexical field. We have to work out an answer to the "definitional" question: which is the difference within a group of homogeneous concepts from the same source, typically between two children-concepts of the same parent concept?

From a definitional viewpoint, concepts to be defined are "definienda", and defining concepts are "definientes". The problem is that very often sources have informal, or poor definitions, and sometimes they lack at all.

When definitions are lacking (when they exist only in-absentia), we ask the definitional question, and create a sound explicit definition, exploiting all hints that a terminological repository can provide (axioms, frames, hierarchy, grouping, informal definitions, boolean combinations, meta-linguistic modifiers, etc.), as well as additional definitional sources (dictionaries, glossaries, encyclopaedias) and finally experts.

Additional sources are the T* for the lexical fields of the in praesentia sources (par. 1.1.2). For instance, ON8.5 exploited [Dorland's, 1994; Stedman, 1995; etc.] and expert physicians from partner institutions. We have also proposed a scale of explicitness [Steve&al, 1996] which is based on the availability of these hints.

2.4.4 Phase III): Multi-local definition of terms: triggering theories related to distinctions made in local definitions For each source, local definitions imply ontologies. In par. 2.1 and par. 2.3, we called these local ontologies. Our purpose is the enrichment of local ontologies by triggering general (global) ontologies, such as [Sowa, 1995; Bateman&al, 1990; Hartmann, 1966; Varzi, Casati, 1994; Simons, 1987, and many others]. Such an enrichment is not an arbitrary choice: it is made in order to connect heterogeneus local definitions by means of an ontological theory, thus it requires a minimal commitment: the consequence of such enrichment should be essentially the raising of definitions from the status of local to multi-local (Fig. 4).

This is much like filling the lexical gaps among different languages: where English has wood, Italian has legno (as matter), bosco (as aggregate of trees), foresta (as wide, heterogeneous aggregate of trees). Not that English lack the Italian ontology: it only does not let it emerge in the lexicon of words, in fact English is capable to paraphrase it. In the terms of par. 1.1, the lexical fields are different, while the semantic field would be equivalent, if all in absentia concepts are taken into account: an in absentia paraphrase would be a gap-filler in the English lexical field for the Italian one.

The definition of lexical gap in ONIONS should sound something like:

<<when we compare two allied (referrable to the same semantic field) lexical fields from heterogeneous lexica, the absence in one lexical field of a concept present in the other is a lexical gap>>.

Filling gaps is a finite, decidable work, thus the stopover problem does not come back in the form: "who knows how much of a general theory should I include in the integrated model?". For an example of the need for an extensive gap-filling within the ON8.5 ontology, see par. 2.5.

Triggering general theories amounts to make an analysis of the contextual framework of local definitions. An ontological engineer has to investigate a large heuristics with several degrees of formality: experts' knowledge, general theories, etc. (par. 1.1 and par. 2.3). This heuristics interacts with an ongoing building of a top-level (see next paragraph), in order to locate appropriate theories and appropriate chunks of them which may allow to fill the gaps among the local lexical fields.

An analogous procedure is followed for deciding on a top-level which is super-imposed to the integrated formal ontology model we want to obtain. This is a subjective work, depending on the "taste" of the ontological engineer. It should also be proposed as a hypothetic and easily modifiable taxonomy. Our current top-level for ON8.5 is figured out as a frame of a state of affairs as envisageable in the medical domain (see par. 3.).

2.4.5 Phase IV): Defining an integrated formal model Building, refining and updating a formal model is a very complex matter, narrow to collapse either in one phase or in few sequential phases. Nevertheless, we remark that the main issues in this phase concern a satisfactory (ie: semantic) account of the ontological commitment of representation primitives, and the choice of an expressive logical language, explicitly committed to representation primitives. Once such issues have been satisfied, the following implementation has only to preserve sensibly the commitment developed to this point.

As a matter of fact we built an ontology (ON8.5) quite generic as far as detailed medical knowledge is concerned, with the only task of being able to represent a general semantic field of medical knowledge through axioms and to be constrained as less as possible by the adopted formal language.

The analysis allowed us to explicitly formalize important semantic fields which have heterogeneous conceptualizations in the sources described in par. 2.4.2, for example:

At present the model is formalized in order sorted logic and we partially tested his translatability in ONTOLINGUA [Gruber, 1993] and in SNEPS. In par. 3. the taxonomic aspect of the top level is outlined, as in par. 4. the committment for its MLC is presented.

The extension and the revision of the existing model in order either to include a new subdomain or to integrate a new source are not to be meant as formal refinements but as ontological enrichments and require other processes not presented in the diagram in Fig. 4. On this subject, remember also the issue of basicality (par. 2.2), that is domain dependent: the extension to a sub-domain would involve as least such revision, and potentially a revision of every concept. This refers to modularity of knowledge, that cannot hold in general. Formal contexts could be useful tools to realize it and to avoid such revisions.

2.5. An example of theory triggering: need for extensive gap-filling in the Process semantic field

A practical example of how local ontologies are gap-filled through general theories can be done (briefly) about the semantic field of processes.

Several local domain ontologies in medicine introduce a lexical field concerning processes, for example UMLS distinguishes among:

while SNOMED distributes part of the UMLS Event semantic field among its fields: Function, Morphology, and Procedure. And so on with the other knowledge sources.

Our integration required the use of some theory of processes: one which is quite influential is derived from Aristotle [for instance, it is adopted in Quirk&al, 1985, cf. also Mourelatos, 1978].

The aristotelian tradition distinguishes among State: a description of a situation at a given time point, with no natural endpoint and no performing agent; Activity: a description of a situation within some indefinite time interval, with no natural endpoint but with a performing agent; and Telic Event: an activity with a natural endpoint. In other words the theory applies the three criteria of Punctuality, Conclusiveness, and Agentivity.

Notice that an allied theory is tentatively formalized by [Dowty, 1977] in terms of a possible worlds formal semantics for the primitives {DO, BECOME, CAUSE}.

On the other hand, a criterion seemed relevant to distinguish between a State and a Stationary Phase within the Process semantic field: Dynamicity [Roeper, 1987].

The result of triggering such general theories is the following subsumption hierarchy with the corresponding partial definitions expressed in an order-sorted logic [Oberschelp, 1989; Cohn, 1989] and quantification applied to n layers of sub-sorts (cf. the assumption of exclusion of individuals from our theory of meaning in par. 1.2). Partial definitions of sub-sorts imply the partial definition inherited from the parent-sort:

As the table shows, definitions are minimal as far as theoretical completeness may be meant to be pursue. In fact, as proposed in par. 2.1 and par. 2.4.4, no pervasive ontology of the Process semantic field can be pursued within ONIONS, which only accounts for the integration problem in a multi-local perspective. On the other hand, one can observe that, once a congruous number of validated sources has been integrated for a domain, a fine degree of completeness may have been reached (for that domain).

The Ontology Group Home Page (under construction)

for information about the maintenance of these pages, write to aldo gangemi