Measuring the evolution of a knowledge library:
Assessment study

Tokyo Research Laboratory, IBM Japan Ltd.
1623-14 Shimo-tsuruma, Yamato-shi
Kanagawa-ken 242-8502, Japan

This paper presents an empirical study to assess a process for the real-life deployment of a library of problem-solving methods dedicated to production-scheduling problems. First, the context of the study is elucidated on the basis of the goal-based framework for software measurement. The organization of the library is briefly explained, and the results of measuring the library's evolution during its deployment are then given. Finally, the context of this study is further refined not only to allow its continuation with clear goals, but also to make the preliminary quantitative results obtained from a real-life setting available for use in other studies of the reuse of problem-solving knowledge.


One of the primary issues in developing knowledge systems is how to improve their extensibility and productivity by using software and knowledge reuse technologies. Libraries of problem-solving methods have been proposed to facilitate the reuse of problem-solving knowledge. The ultimate goal of knowledge reuse is to demonstrate that qualified knowledge libraries facilitate the development of knowledge systems in real life, not only for the prototype and first release, but also for the further enhancement and maintenance of the systems. However, libraries of problem-solving knowledge evolve according to the clarification of implicit ontological commitments. This kind of volatility is partly due to the interaction problem, namely, the dependence of knowledge on its use [Bylander and Chandrasekaran, 1987], because reusable libraries cannot exist independently of the assumptions about how they will be (re)used in a real-life context [Clancey, 1993, Brown and Duguid, 1994].

Since the benefits of knowledge reuse cannot be properly evaluated by developing a single, prototype knowledge system, it is important to take account of questions related to sample population, reality of environments, and time range of studies. A series of Sisyphus efforts have moved from a single, prototype system for room assignment [Linster, 1993] toward a more realistic example of elevator design [Schreiber and Birmingham, 1996], and are now ready for quantitative measurement of the knowledge-engineering performance [Shadbolt, 1995]. At the same time, many groups are pursuing their own approaches to a variety of sample problems [Marques, Dallemagne, Klinker, McDermott, and Tung, 1992, Runkel, Birmingham, and Balkany, 1994, Eriksson, Shahar, Tu, Puerta, and Musen, 1995]. Although long-term studies that follow the evolution of knowledge systems in real life are extremely rare, experiences over a ten-year period have been reported in the case of XCON [Barker and O'Connor, 1989].

Every effort deals with some of the issues mentioned above, but it is almost impossible to conduct a long-term empirical study for a number of real-life tasks. Thus, the real challenge for the whole research community is to make the empirical results cumulative [Simon, 1995]. The difficulties will then be twofold. One is that all empirical results are open to interpretation, and the other is that even if the interpretations of results are generally agreed on, the results are invariably tied to an experimental setup [Hanks, Pollack and Cohen, 1993]. Therefore, it is necessary in reporting empirical results with careful characterization of the context of the study. Without knowing the context, readers cannot determine whether the results are comparable and, what is worse, they may reach an inadequate understanding of the results. Ways of presenting results should allow readers to more or less reconstruct the empirical study.

We have worked on a component-oriented methodology for developing knowledge systems, and on elicitation of reusable problem-solving knowledge from existing scheduling expert systems [Hori, Nakamura, Satoh, Maruyama, Hama, Honda, Takenaka, and Sekine, 1995]. On the basis of our experiences in scheduling problems, we have designed and implemented a component library for production scheduling systems, SCOOP (Scheduling COmponents fOr Production control systems) [Hori and Yoshida, 1996]. After developing two prototype systems, we made SCOOP available in July 1995 for the development of real-life production-scheduling systems. Scheduling systems developed with SCOOP are now running at IBM's hard-disk manufacturing plants in Thailand and Hungary. In developing and deploying SCOOP, we collected empirical data that may characterize the library and its reuse process.

The results presented in this article are inextricably associated with a particular problem type and the situation at hand. Therefore, we are not attempting to claim that these results can be generalized to all kinds of knowledge library. We know that our sample sizes are too small to have statistical validity. The empirical study here is regarded as an assessment study to establish baselines, ranges, and other aspects of the behaviors of a system or its environment [Cohen, 1995, p. 7,]. Two complementary aspects of empirical studies are distinguished in [Cohen, 1995]: an exploratory phase for summarization and modeling, and an experimental (confirmatory) phase for testing hypotheses and predictions. At this moment, our work stays in the exploratory phase by itself. From a practical point of view, however, exploratory studies in industrial settings and experimental research in laboratory settings must be linked, so that hypotheses suggested by the former can be tested and confirmed by the latter.

The objective of this article is to clarify the context of our empirical study, so that insights obtained from this study in an industrial setting can be transferred to other exploratory or experimental studies. Although our study deals with only a single project for the deployment of SCOOP, a salient feature is that it focuses on the evolution of the library in real life. In the next section, a goal-based framework for software measurement [Basili and Rombach, 1988] is introduced, and the context of our study is constructed on the basis of the framework. Section 3 describes the organization and application of SCOOP, and empirical data collected during the deployment are given in Section 4. Finally, the context of this study is further refined, not only to allow this study to be continued with clear goals, but also to make our preliminary results available for use in other studies of the reuse of problem-solving knowledge.


Measurement is the process of assigning symbols (usually numbers) to attributes of entities in the real world. There are three classes of entities in software measurement [Fenton, 1994]:

are any software-related activities that take place over time.
are any artifacts, deliverables, or documents that result from processes.
are items required by processes

Furthermore, internal and external attributes are distinguished within each class of entity [Fenton, 1994]:
An internal attribute
is one that can be measured by examining the product, process, or resource itself, and can be separated from its behavior.
An external attribute
is one that can be measured only with respect to other entities in its environment.

For example, the size of program code is an internal attribute of a product, which can be measured in isolation. In contrast, productivity is an external attribute of a resource (i.e., personnel, either as individuals or groups), because it depends on the personnel's skills and the quality of products to be delivered. Maintainability is an external attribute of a product, which also depends on the skills of the maintainers and the tools available to them.

It must be noted here that we cannot measure the external attributes directly without reference to internal attributes. It is important to keep this distinction in mind when conducting an empirical study with measurements. Internal attributes may suggest what we are likely to find as external attributes. However, the relationship between internal attribute values and the resulting external attribute values has rarely been established, because it is sometimes difficult to perform the controlled experiments necessary to confirm the relationship [Fenton and Pfleeger, 1997, p. 80,]. In addition, it has also been pointed out that current research on software measurement should try to find good measures, so that indirect measures can be derived from those original measures and predictive theories can then be constructed once the cause-and-effect relationships have been identified [Henderson-Sellers, 1996, p. 66,].

Our goal is to create a qualified library of problem-solving knowledge. However, software is created by development, not by production as in the manufacture of consumer products, because software artifacts may not necessarily be reproduced in the same form. This means that we do not have a lot of data collection points to construct reasonably accurate models for a statistical quality model [Basili, 1996]. This situation parallels the difficulty in coping with the interaction problem mentioned earlier. The dependence of knowledge on its use is manifested when we consider issues related to the external attributes. Therefore, it is useful to think about internal attributes of knowledge, either as product, process, or resource, before we investigate the interaction problem.

A typical problem in the application of software measurement is an absence of any coherent goals. Metrics are used for measuring some aspects of quality factors such as efficiency, usability, and maintainability. Interpretations of a quality factor usually depend on the perspectives of the observers, who may include software engineers, domain experts, and end users. Furthermore, trade-offs may exist between such perspectives.

In the literature of software engineering, a methodology called the goal/question/metrics (GQM) paradigm [Basili and Rombach, 1988] was proposed to guide the integration of measurement into the software development process, and is now widely used for software measurement. This methodology is characterized by two measurement principles. First, measurement must be top-down, clarifying what measures should be collected. Second, the collected data must be interpreted in the context of a goal and objective. The GQM approach provides a framework involving three steps:

  1. Set up analysis goals of interest.
  2. Refine each goal into questions that must be answered to determine whether the goal is being met.
  3. Decide what must be measured in order to answer the questions adequately.

In terms of the goal definition templates provided in [Basili and Rombach, 1988], a major goal of our study can be defined as follows. Italicized words below are selected from terms in the templates.
Goal definition

The above goal definition is still blurred as a result of the use of vague words such as ``characterize'' and ``understand.'' However, it can be clarified during further iteration of the measurement process. A certain amount of backtracking is an inevitable feature of top-down approaches.

The goal definition is then refined into one or more questions, and finally metrics are derived so that the questions can be answered to meet the top-level goal. However, it is generally accepted that progressing from goal to question is the most difficult part of the GQM approach, which provides little guidance for identifying useful questions [Shepperd, 1995, p. 146,].

Figure 1: GQM hierarchy for the deployment of SCOOP

When we started to design SCOOP at the beginning of 1994, we had some questions about the organization of reusable libraries of problem-solving methods. These questions resulted not only from our experiences in applying a component-oriented approach to scheduling problems [Hori, Nakamura, and Hama, 1994], but also from informal interviews with software engineers who have worked on a commercial software package for production-scheduling systems.

The first point is that knowledge systems cannot live by problem-solving methods alone in real life. Real-life knowledge systems consist of high-level modules or subsystems not only for problem-solving methods, but also for an application domain model and user interfaces. For instance, the required characteristics of scheduling systems are diverse, and are not necessarily limited to the features of scheduling algorithms [Liebowitz and Potter, 1993]. Practical systems in real environments must meet certain requirements as regards interactivity with end users, in addition to requirements as regards interoperability with peripheral information systems [Mitev, 1994].

This observation leads to a high-level module structure of knowledge systems, which consist of three subsystems:gif a domain model, problem-solving methods, and graphical user interfaces. In particular, these subsystems should be designed, so that mutual dependency between subsystems can be eliminated. Otherwise, mutually dependent subsystems require monolithic treatment that hinders reuse of the library.

The second point is that successful knowledge systems evolve continually in their context of use. Thus it is also necessary for a library of problem-solving methods to be enhanced, when a novel function is realized in a knowledge system and can be generalized for inclusion in the library. A pitfall then is that over-commitment to an individual problem may spoil the reusability of the library. This might not be so serious initially, even if some project-specific functions are added to the library. But the overhead will become apparent as the whole library is upgraded consistently, with the addition of bells and whistles that may not be crucial in the intended scope of reuse.

The above observations raise two questions as follows under the top-level goal. The first question is to what extent the high-level module structure is altered throughout the project cycles. Once this question is posed, it is necessary to measure the code size of each high-level module. In this study, we focus particularly on the project-independent portion of a knowledge system, which will be reused as a library in different projects. Therefore, the size should be measured by determining whether a portion of the module depends on a particular project at hand.

The second question is whether productivity is significantly affected by dependence of a high-level module in each project cycle. This question refers to productivity, which is an external attribute that cannot be measured directly. Measures related to this question will be the increase in code size during a project cycle and the person-hours spent on the process. As in the case of the measures derived from the first question, the code size and person-hours here should also be measured by taking account of the dependence on the project at hand.

A summary of the GQM hierarchy in our study is given in Figure 1, where the two of the measures (M1 and M2) are internal product attributes, while the other two (M3 and M4) are external process attributes. Although this hierarchy comes out in a retrospective manner, the GQM provides a framework for structuring and documenting measurement work.


SCOOP is implemented in an object-oriented language (C++) and currently includes about 40 classes. SCOOP consists of three subsystems: a schedule model, a scheduling engine, and a graphical user interface (Figure 2). The schedule model provides core concepts for the manufacturing domain such as product, lot, and resource. The graphical user interface subsystem consists of facilities for visualizing and modifying the internal status of a schedule model. The scheduling engine provides a hierarchy of problem-solving methods. The scheduling engine subsystem is explained later in this section, while further details of the other subsystems are found in [Hori and Yoshida, 1997].

Figure 2: Overview of SCOOP

3.1 Scheduling methods

The scheduling engine subsystem comprises two scheduling methods: an assignment method and a dispatching method. As shown in Figure 3,gif every lot (ScLot) is indirectly associated with operations (ScUnit) via a production sequence (ScJob). This whole-part relation between a lot and units articulates the roles of the two scheduling methods. That is to say, the dispatching method, which deals with lot priorities, determines the global manufacturing status. In contrast, the assignment method is concerned with the status on a more detailed level, since this method deals with the starting and finishing times of each operation in relation to the manufacturing equipment (ScRsc).

Figure 3: Scheduling engine subsystem

The assignment method (EgAssign) assigns units (ScUnit) to resources (ScRsc), fixing the starting and finishing time of each unit. EgAssign provides primitive inferences common to two assignment methods: a forward and a backward assignment method. The former tries to assign each unit to an appropriate resource as early as possible, up to the earliest starting time, while the latter does the same as late as possible, down to the due date.

The forward and backward assignment methods are fully operationalized in EgForwardAssign and EgBackwardAssign respectively, which are defined as subclasses of EgAssign. Scheduling methods can be customized by overriding the operationalized classes for assignment. For example, the loads of alternative resources can be balanced by redefining an inference for resource selection in a subclass of EgForwardAssign and/or EgBackwardAssign.

The dispatching method (EgDispatch) determines the priority of each lot (ScLot) without regard to actual assignments between units and resources. The most important inference for dispatching is to compare two lots and determine the one with higher priority. Since the comparison employs attributes defined in the schedule model or its project-specific extension, the comparing inference is defined in the schedule model as a member function of ScLot or its project-specific subclass, and is not given within EgDispatch or any of its subclasses in the scheduling engine subsystem.

3.2 Application development in real life

After the development of two prototype systems with SCOOP, a real-life scheduling system was developed for use in a production line for low-end hard disk drives (2.5 and 3.5-inch HDDs). The plant floor consists of three areas, for assembly, testing, and packing. The difficulty in the production control stems from the uncertainty in the test processes, which is primarily caused by the fluctuating yield, the high-volume outputs, and the variety of product types. The plant is thus continually faced with the tasks of reducing the manufacturing cycle time and achieving punctual delivery to customers. In the spring of 1994, the operation system department started reengineering the manufacturing processes by developing three software systems: a scheduling system for the test area, a shipment control system for the packing area, and a system for tracking work in process (WIP) inventory.

Figure 4: Evolution of a real-life scheduling system

The scheduling system is required to be flexible in controlling production in response to new demands and variable manufacturing conditions, and extensible so that it can be installed on the plant floor incrementally. However, the plant is already in operation, producing high-volume daily output. Therefore, it is critical to introduce the new production operations incrementally, so that floor workers are not forced to make drastic changes in their daily operations. SCOOP's extensibility has allowed it to be deployed incrementally in the plant in Thailand since August 1995, and in another manufacturing plant in Hungary since July 1996.gif The scheduling system has evolved in the manufacturing context, incorporating additional production data from other information systems in the plant. Furthermore, collaboration is being tightened not only with the manufacturing department, which is responsible for resource utilization, but also with the production department, which is responsible for meeting customers' deadlines.

The system was extended from a simple configuration without any scheduling method to a configuration with two types of method. From the perspective of the high-level system structure, the configuration of the scheduling system has grown according to the three project cycles, as shown in Figure 4.

The configuration shown in Figure 4(1) realizes a monitoring system that visualizes the actual manufacturing status received from the WIP-tracking system. The role of this progress monitor is to check for and regulate incorrect operations by the floor workers, who manually transport vehicles that carry products from one process to the next. A forecasting system is then realized by integrating a forward assignment method (Eng-1 in Figure 4(2)), in order to forecast the completion time of products in the test area. The forecasting system is invoked automatically every twenty minutes, and the results obtained are reported online to the workers in the packing area.

A release control system is realized by extending the forecasting system with a dispatching method (Eng-2 in Figure 4(3)). The dispatching method dynamically prioritizes the manufacturing lots to be released next to the test area, so that the requests from the production department can be answered on the fly. In contrast to conventional scheduling systems, which create a static schedule for the next day or week, this scheduling system responds to the up-to-date manufacturing progress iteratively in the shorter cycle time. This avoids relying on inaccurate assumptions made in advance, and allows punctual delivery by monitoring the real-time manufacturing performance and packing status.


This section describes the empirical results obtained during the development of the scheduling system for HDD manufacturing. In order to collect empirical data, a complete set of source code is saved when either a major enhancement is made without any application development, or when an application system is released. In addition, a software developer fills in a form listing the coding hours and the type of changes made, which can be either creation, modification, or deletion of code. The models for measuring the size and effort are as follows:

The size of a program is calculated as the number of source lines of code (LOC), which is defined to include all code, header lines, comments, and blank lines.
The effort is recorded as the number of person-hours spent working at computer terminals. It thus includes hours spent coding and debugging, but does not include hours spent on requirement analysis and meetings about high-level system design.

The scheduling system has been developed by a single software developer up to the end of the third cycle. At the beginning of the first cycle, the developer had 9 years of programming experience in Lisp and 3 years of experience in C++. SCOOP was also designed and developed by the same person. Thus, the case presented in this paper is not usual in the sense that the library is reused by the library developer by himself. However, the advantage here is that the collected data are placed in the same context, allowing proper comparison of data collected from different project cycles.

4.1 Code size growth

Figure 5 shows the code size of a subsystem as a percentage of the code sizes of all subsystems. In particular, Figure 5(a) shows the proportion in the project-independent portion of the system, which corresponds to the library provided as SCOOP. Figure 5(b), on the other hand, shows the proportion in the project-specific portion of the system, which will not be reused in other projects.

Figure 5: Proportion of code in each subsystem

Figure 5(a) tells us that the relative proportions of the three subsystems were rather stable, except for the growth of the scheduling engine subsystem in the third project cycle. This growth resulted from the addition of a dispatching method, or EgDispatch, and its two subclasses (Figure 3). On the other hand, in the project-specific portion, the subsystem ratio and the total size changed substantially through all the project cycles. This can be further investigated by taking account of the actual code size growth and person-hours spent on the coding (Table 1).

Table 1: Increases in LOC and person-hours between project cycles

After the end of the first project cycle, the project-specific proportion of C++ source code constituted 29% of the entire scheduling system. The proportion further increased to 32% during the second cycle, and reached 48% at the end of the third cycle. The tough question here is to what extent the reused/reusable part of the evolving system should be kept, taking into account not only the maintainability and extensibility of the system at hand, but also the reusability of the library for the development of prospective systems. In the present case, it is possible to further investigate the code size growth in each subsystem.

The code size of the scheduling engine subsystem grew by 37% (599 LOC) after the second cycle (Table 1(b)). This extension, however, has not required any change in the design of the other subsystems. In the scheduling engine subsystem of the final system, the total amount of project-specific code is 300 lines, of which 92 LOC are for the forecasting function and 208 LOC are for the release control function. The assignment method (Eng-1 in Figure 4(2)) was extended to take account of the test failure ratio, by incorporating yield values obtained from the failure analysis system. The dispatching method (Eng-2 in Figure 4(3)) was extended to deal with the project-specific attributes, which are used to prioritize available lots to be released in the test area. Since the code for the project-specific scheduling methods constitutes only 12% of the code for all the scheduling methods in the final system, it is possible to say that the development workload is reduced by reusing methods in the scheduling engine subsystem.

The requirements of the graphical user interface are diverse, especially when customization for end users is needed. Although the user interface subsystem was developed by using an existing library of conventional visual componentsgif such as push buttons, list boxes, and pull-down menus, it is almost impossible to provide a comprehensive set of components tailored to a variety of end-users' requirements. The project-independent portion of the user interface subsystem has grown by 25% (685 LOC) in code size through all the project cycles (Table 1(c)). This subsystem is not stabilized yet, and is likely to be extended by further deployment of SCOOP.

In contrast, the code size of the schedule model has grown by only 3% (87 LOC) through all the project cycles (Table 1(a)). Therefore, it is possible to say that the schedule model is rather stable as compared with the scheduling engine and the user interface subsystems. The schedule model, however, includes a large extension in the project-specific portion, which has grown radically due to the enhancement of the system on the plant floor. The main part of the project-specific extension is for incorporating external data from other information systems such as the shipment control system and failure analysis system (Figure 4). The other extension is to elaborate the concepts in the schedule model, so that project-specific attributes and their access procedures can be provided to deal with external data.

4.2 Effort for the development and deployment

Since we have collected data on the person-hours spent on coding, as well as the code size growth, it is possible to investigate issues related to productivity. Productivity is generally defined as the rate of output per unit of input. Intuitively, the notion of productivity includes the contrast between the inputs and outputs of a process. However, measuring the productivity of a software engineer or a software engineering process is more elusive. We must carefully define what constitutes a set of inputs, how a process affects the inputs, and what a suitable unit of output is [Shepperd, 1995, Fenton and Pfleeger, 1997].

The most common productivity measure is defined as size divided by effort. In the case at hand, we measured the size in lines of code, and the effort in person-hours spent on coding. Therefore, it is possible to calculate the productivity of the software developer by dividing the LOC by the person-hours. Since the number of data in Table 2 is rather small, it is hard to generalize the results at present. But it can be observed from the table that the LOC/hours ratios in the project-independent portion of the subsystems for the schedule model and the graphical user interface are relatively low. One possible reason for this result is that the modification of the project-independent, or reusable, portion of knowledge systems includes frequent insertion and deletion of small fragments of code, and that such an exploratory process rarely results in a substantial growth in code size. Another reason is that such modifications require more time to test before release, not only for the system at hand but also for other prototype systems used as test cases.

As for the scheduling engine subsystem, a dispatching method is added to the subsystem during the third project cycle, and the fundamental problem-solving steps are completely different from those for the assignment method. Therefore, a possible interpretation is that the productivity was relatively low (13.0) in comparison with other modifications for the scheduling engine subsystem.

The person-hours spent on realizing the project-specific portion of the scheduling methods accounted for only about 10%, or 20 hours (4 hours for the forecasting function and 16 hours for the release control function) of the total coding hours (209 hours) for the project-specific portion throughout the three project cycles. The scheduling methods in the library thus reduced the workload involved in developing the scheduling system. However, another 20 hours were spent on modifying the project-independent portion of the scheduling engine subsystem in parallel with the SCOOP-based development of an application system.

It must be noted here that the scheduling system reported here has been developed by a single software developer, who also designed and realized SCOOP. Needless to say, it will take more time to maintain SCOOP the application programmer is different from the library designer, because it is usually very difficult for the application programmer to refine the library due to the unavailability of the relevant documents and specifications.

Besides the scheduling engine subsystem, the rest of the coding hours was spent on specializing the schedule model (136 hours) and developing the graphical user interface (53 hours). Specialization of the schedule model took up about 65% of the total coding hours for the project-specific portion (209 hours). The specialization consisted mainly of incorporating information from external data servers. This exactly reflects the observation by van Heijst et al., that the world is filled with knowledge bases and databases which are not developed with that purpose in mind [van Heijst, Schreiber, and Wielinga, 1997, p. 287,].

Table 2: Ratios of LOC/hours through all the project cycles

Moderate stability is an essential feature of reusable knowledge, but empirical studies are needed to draw up practical guidelines for designing reusable libraries. In our experience, 30% (89/298) of the total coding hours were spent on modifying the library. Although an in-depth discussion is beyond the scope of this article, it is probable that modification of a reusable library will occur in parallel with development of a knowledge system. Furthermore, such modification could happen repeatedly in the maintenance of the system.

Finally, we must be very careful in dealing with the productivity measure used here, because it takes account only of lines of code divided by coding hours, without regard to the quality or creativity of the development process. It is as if we were to measure the productivity of poets by computing the number of words or lines they produced and dividing by the number of hours they spend writing [Fenton and Pfleeger, 1997, pp. 409-410,].


Research efforts in knowledge engineering have been more or less directed by the knowledge-level hypothesis that advocates the behavior of a rational agent independently of particular implementational formalism [Newell, 1982]. As a result, a variety of knowledge-level models have been proposed in the research community. This demonstrates that the constituents of problem-solving knowledge can be common to a variety of knowledge systems at the conceptual level, regardless of their implementational details. However, it does not necessarily mean that such knowledge can be actually reused with operational extension in real life. An assumption shared with that approach is the notion of the structure-preserving design [Schreiber, 1993], in which the information content and structure of the knowledge model are preserved in the final artifact. In terms of software measurement, this is related to structure measures rather than code measures. The structure measures view a product as a component of a larger system and focus on the interconnection between software components, while the code measures have no concern with interconnection.

It is pointed out that a qualitative conceptual model may fail to keep desirable properties such as testability and maintainability, as results of investigation of search-space tractability from a graph-theoretic perspective [Menzies, Cohen, and Waugh, 1998]. Besides the search space structure, however, there are other possibilities of conceptual structures that can be preserved without serious impediments in an implementational structure. Structuredness is an issue to be further elaborated considering different granularity of constructs, such as at the levels of subsystems, individual concepts, and primitive operations. We are going to further investigate the structuredness of SCOOP by using coupling and cohesion measures [Hori, 1998], which are defined on the basis of a model of modular system [Briand, 1996], and measurement viewpoints advocated in [Chidamber and Kemerer, 1994]. These measures are to be put under the first question in Figure 1.

The design process can be subdivided into two design steps: architectural or high-level design, which involves identifying software components and their interconnections, and algorithmic design, which involves identifying the control flow and data structures within architectural components [Rombach, 1990]. One of the limitations of design measures is that possible dynamic behaviors of a system cannot be captured. However, the potential benefit is that they can be measured independent of any particular implementation, especially before the start of coding. In order to characterize a library of problem-solving knowledge, it is necessary to think about an additional question in the GQM hierarchy; namely, whether architectural design information has more influence on the applicability of a knowledge library than algorithmic design information.


Barker and O'Connor, 1989
Barker, V. E. and O'Connor, D. E. (1989). Expert systems for configuration at Digital: XCON and beyond. Communications of the ACM, 32(3), 298-318.
Basili and Rombach, 1988
Basili, V. R. and Rombach, H. D. (1988). The TAME project: Towards improvement-oriented software environments. IEEE Trans. on Software Engineering, 14(6), 758-773.
Basili, 1996
Basili, V. R. (1996). The role of experimentation in software engineering: Past, current, and future. Proceedings of the 18th International Conference on Software Engineering, pp. 442-449, Berlin, Germany.
Briand, 1996
Briand, L. C., Morasca, S., and Basili, V. R. (1996). Property-based software engineering measurement. IEEE Trans. on Software Engineering, 22(1), 68-86.
Brown and Duguid, 1994
Brown, J. S. and Duguid, P (1994). Borderline issues: Social and material aspects of design. Human-Computer Interaction, 9, 3-36.
Bylander and Chandrasekaran, 1987
Bylander, T. and Chandrasekaran, B. (1987). Generic tasks for knowledge-based reasoning: the ``right'' level of abstraction for knowledge acquisition. International Journal of Man-Machine Studies, 26, 231-243.
Chidamber and Kemerer, 1994
Chidamber, S. R., and Kemerer, C. F. (1994). A metrics suite for object oriented design. IEEE Trans. on Software Engineering, 20(6), 476-493.
Clancey, 1993
Clancey, W. J. (1993). The knowledge level reinterpreted: Modeling socio-technical systems. International Journal of Intelligent Systems, 8, 33-49.
Cohen, 1995
Cohen, P. R. (1995). Empirical Methods for Artificial Intelligence. MIT Press.
Eriksson, Shahar, Tu, Puerta, and Musen, 1995
Eriksson, H, Shahar, Y., Tu, S. W., Puerta, A. R. and Musen, M. A. (1995). Task modeling with reusable problem-solving methods. Artifical Intelligence, 79(2), 293-326.
Fenton, 1994
Fenton, N. (1994). Software measurement: A necessary scientific basis. IEEE Trans. on Software Engineering, 20(3), 199-206.
Fenton and Pfleeger, 1997
Fenton, N. E. and Pfleeger, S. L. (1997). Software Metrics: A Rigorous & Practical Approach, Second Edition. International Thomson Computer Press.
Hanks, Pollack and Cohen, 1993
Hanks, S., Pollack, M. and Cohen, P. (1993). Benchmarks, test beds, controlled experimentation and the design of agent architectures. AI Magazine, 14(4), 17-42.
Henderson-Sellers, 1996
Henderson-Sellers, B. (1996). Object-Oriented Metrics: Measures of Complexity. Prentice Hall, Englewood Cliffs, NJ.
Hori, Nakamura, and Hama, 1994
Hori, M., Nakamura, Y. and Hama, T. (1994). Configuring problem-solving methods: a CAKE perspective. Knowledge Acquisition, 6(4), 461-488.
Hori, Nakamura, Satoh, Maruyama, Hama, Honda, Takenaka, and Sekine, 1995
Hori, M., Nakamura, Y., Satoh, H., Maruyama, K., Hama, T., Honda, S., Takenaka, T. and Sekine, F. (1995). Knowledge-level analysis for eliciting composable scheduling knowledge. Artificial Intelligence in Engineering, 9(4), 253-264.
Hori and Yoshida, 1996
Hori, M. and Yoshida, T. (1996). SCOOP: An object-oriented framework for production scheduling systems. Proceedings of the IFIP International Conference on Advances in Production Management Systems, pp. 389-394, Kyoto, Japan.
Hori and Yoshida, 1997
Hori, M. and Yoshida, T. (1997). Domain-oriented library of scheduling methods: Design principle and real-life application. IJCAI-97 Workshop Proceedings of Problem-solving Methods for Knowledge-based Systems, pp. 59-74, Nagoya, Japan.
Hori, 1998
Hori, M. (1998). Ontology assessment by coupling and cohesion measures. Japanese Society for Artificial Intelligence SIG-KBS-9703, in Japanese (to appear).
Liebowitz and Potter, 1993
Liebowitz, J. and Potter, W. (1993). Scheduling objectives, requirements, resources, constraints, and processes: implications for a generic expert scheduling system architecture and toolkit. Expert Systems With Applications, 9(3), 423-432.
Linster, 1993
Linster, M. (1993). A review of Sisyphus 91 & 92: Models of problem-solving knowledge. In N. Aussenac, G. Boy, B. Gaines, M. Linster, J.-G. Ganascia, and Y. Kodratoff (Eds.), Knowledge Acquisition for Knowledge-Based Systems, pp. 159-182, Lecture Notes in Artificial Intelligence 723, Berlin: Springer-Verlag.
Marques, Dallemagne, Klinker, McDermott, and Tung, 1992
Marques, D, Dallemagne, G., Klinker, G., McDermott, J., and Tung, D. (1992). Easy programming: Empowering people to build their own applications. IEEE Expert, 7(3), 16-29.
Menzies, Cohen, and Waugh, 1998
Menzies, T., Cohen, R. F., and Waugh, S. (1998). Evaluating conceptual modeling languages. Proceedings of the 11th Workshop on Knowledge, Acquisition, Modeling and Management (KAW '98), Banff, Canada (to appear).
Mitev, 1994
Mitev, N. N. (1994). The business failure of knowledge-based systems: linking knowledge-based systems and information systems methodologies for strategic planning. Journal of Information Technology, 9, 173-184.
Newell, 1982
Newell, A. (1982). The knowledge level. Artificial Intelligence, 18, 87-127.
Runkel, Birmingham, and Balkany, 1994
Runkel, J. T., Birmingham, W. P., and Balkany, A (1994). Separation of knowledge: A key to reusability. Proceedings of the 8th Banff Knowledge Acquisition for Knowledge-based Systems, pp. 36.1-30.21, Banff, Canada.
Rombach, 1990
Rombach, H. D. (1990). Design measurement: Some lessons learned. IEEE Software, 7(2), 17-25.
Shadbolt, 1995
Shadbolt, N. (1995). Sisyphus III proposal.
Shepperd, 1995
Shepperd, M. (1995). Foundations of Software Measurement. Prentice Hall International, UK.
Schreiber, 1993
Schreiber, A. T. (1993). Operationalizing models of expertise. In Schreiber, A. T., Wielinga, B. J. and Breuker, J. A. (Eds.), KADS: A Principled Approach to Knowledge-Based Systems Development, pp. 119-149, Academic Press.
Schreiber and Birmingham, 1996
Schreiber, A. T. and Birmingham, W. P. (1996). The Sisyphus-VT initiative. International Journal of Human-Computer Studies, 44, 275-280.
Simon, 1995
Simon, H. (1995). Artificial intelligence: An empirical science. Artifical Intelligence, 77, 95-127.
van Heijst, Schreiber, and Wielinga, 1997
van Heijst, G., Schreiber, A. T., and Wielinga, B. J. (1997). Using explicit ontologies in KBS development. International Journal of Human-Computer Studies, 46(2/3), 181-292.
Wirfs-Brock, Wilkerson, and Wiener, 1990
Wirfs-Brock, R., Wilkerson, B., and Wiener, L. (1990). Designing Object-Oriented Software, Prentice Hall, Englewood Cliffs, NJ.

This is a term used in the literature of object-oriented design. A subsystem means a set of classes (and possibly other subsystems) that collaborate to fulfill a set of responsibilities [Wirfs-Brock, Wilkerson, and Wiener, 1990].

Figure 3 shows only the part of the schedule model that is strongly related to the scheduling methods.

Precisely speaking, different configurations of the same system are installed in the two factories.

The code sizes of such visual components are excluded from the statistics in this paper.