Evaluation Issues for Visual Programming Languages

Tim Menzies

Artificial Intelligence Department
School of Computer Science and Engineering
The University of NSW
tim@menzies.com
http://www.cse.unsw.edu.au/ timm

February 20, 1998

Abstract:

Visual programming systems have several advantages: (1) they are very motivating for beginners; (2) a spatial representation simplifies certain limited kinds of inferencing; (3) the use of ill-structured diagrams may assist in brain-storming. However, these three benefits may not be widely applicable. Many software engineering and knowledge engineering problems are not inherently spatial. Also, most VP tools do not support ill-structured diagrams. Lastly, diagrams are not necessarily superior explanation tools. In many cases, studies claiming certain benefits with visual systems can be matched by a counter-study with the opposite results. Clearly, some variable is not being controlled for within these opposing studies. It is possible that the task of a system is more important than its presentation (visual or textual).

Introduction

Many knowledge acquisition systems use some sort of visual presentation. Kremer argues convincingly that such visual languages have numerous advantages for knowledge acquisition (KA) [Kremer, 1998]. Other researchers claim numerous benefits for visual frameworks. For example:

When we use visual expressions as a means of communication, there is no need to learn computer-specific concepts beforehand, resulting in a friendly computing environment which enables immediate access to computers even for computer non-specialists who pursue application [Hirakawa & Ichikawa, 1994].

This case that pictures assist in explaining complicated knowledge seems seems intuitively obvious. But is it correct? Other widely held intuitively obvious beliefs have been found to be incorrect, and sometimes even spectacularly so:

Galen's descriptions of human physiology, written in 200B.C., were studied as virtual gospel for millennia. However, no one thought to check Galen's descriptions until one upstart surgeon had the gall (pun intended) to pick up a scalpel and perform dissections for himself. Versalius's De Humani Corporis Fabrica, published in 1543, showed that many of Galen's descriptions were inaccurate.
Henri Fayol told us in 1916 that managers plan, organise, co-ordinate and control. This model of management lasted nearly six decades until proved inadequate by the empirical observations of Henry Mintzberg [Mintzberg, 1975]. Mintzberg found (e.g.)
- 56 U.S. foremen who averaged 583 activities in an eight-hour shift (one every 48 seconds).
- 160 British middle and top managers who worked for half an hour or more without interruption only once every two days.
These empirical results of actual managerial behaviour does not fit Fayol's model of managers as systematic planners.

This article takes a critical look at the available evidence on the efficacy of visual programming (VP) systems. After an introduction to VP, we will review theoretical studies and small scale experimental studies suggest an inherent utility in visual expressions. However, when we explore the available experimental evidence, we find numerous contradictory results. This exploration extends my previous arguments in this area [Menzies, 1996].

A (Brief) Introduction to Visual Programming

As a rough rule-of-thumb, a visual programming system is a computer system whose execution can be specified without scripting except for entering unstructured strings such as Monash University Banking Society or simple expressions such as X above 7 . Visual representations have been used for many years (e.g. Venn diagrams) and even centuries (e.g. maps). Executable visual representations, however, have only arisen with the advent of the computer. With falling hardware costs, it has become feasible to build and interactively manipulate intricate visual expressions on the screen.

More precisely, a non-visual language is a one-dimensional stream of characters while a VP system uses at least two dimensions to represent its constructs [Brown & Kimura, 1994]. We distinguish between a pure VP system and a visually supported system:

A pure VP system must satisfy two criteria. Rule 1: the system must execute. That is, it is more than just a drawing tool for software or screen designs. Rule 2: the specification of the program must be modifiable within the system's visual environment. In order to satisfy this second criteria, the specification of the executing program must be configurable. This modification must be more than just (e.g.) merely setting numeric threshold parameters.
There exists a class of VP systems that are not pure, but are visually supported systems. Most commercial VP systems are not pure VP systems, such as Microsoft's VISUAL BASIC, Borland's DELPHI, and IBM's VISUAL AGE.

Arguments for the Advantages of VP

Many authors argue that VP systems are a better method for users to interact with a program. Green et. al. [Green et al., 1991] and Moher et.al. [Moher et al., 1993] summarise claims such the above quote from [Hirakawa & Ichikawa, 1994] as the superlativist position; i.e. graphical representations are inherently superior to textual representations. Both the Green and Moher groups argue that this claim is not supported by the available experimental evidence. Further, they argue against claims that visual expressions offer a higher information accessibility; for example:

Pictures are superior to texts in a sense that they are abstract, instantly comprehensible, and universal. [Hirakawa & Ichikawa, 1994]

My own experience with students using visual systems is that the visual environment is very motivating to students. Others have had the same experience:

The authors report on the first in a series of experiments designed to test the effectiveness of visual programming for instruction in subject-matter concepts. Their general approach is to have the students construct models using icons and then execute these models. In this case, they used a series of visual labs for computer architecture. The test subjects were undergraduate computer science majors. The experimental group performed the visual labs; the control group did not. The experimental group showed a positive increase in attitude toward instructional labs and a positive correlation between attitude towards labs and test performance. [Williams et al., 1993]

For another example of first year students being motivated by a VP language, see [Glinert & Tanimoto, 1984] (p18-19). However, merely motivating the students is only half the task of an educator. Apart from motivating the students, educators also need to train students in the general concepts that can be applied in different circumstances. The crucial case for evaluating VP systems is that VP systems improve or simplify the task of comprehending some conceptual aspect of a program. If we extend the concept of VP systems to diagrammatic reasoning in general, then we can make a case that VP has some such benefits. Larkin and Simon [Larkin & Simon, 1987] distinguish between:

Sentential representations whose contents are stored in a fixed sequence; e.g. propositions in a text.
Diagrammatic representations whose contents are indexed by their position on a 2-D plane.

While these two representations may contain the same information, their computational efficiency may be different. Larkin and Simon present a range of problems modeled in a diagrammatic and sentential representation using production rules. Several effects were noted:

Perceptual ease: Certain features are more easily extracted from diagrams than from sentential representations. For example, adjacent triangles are easy to find visually, but require a potentially elaborate search through a sentential representation.
Locality aids search: Diagrams can group together related concepts. Diagrammatic inference can use the information in the near area of the current focus to solve current problems. Sentential representations may store related items in separate areas, thus requiring extensive search to link concepts.

A common internal representation for a VP systems is one that preserves physical spatial relationships. For example, Narayanan et.al. [Narayanan et al., 1995] use Glasgow's array representation [Glasgow et al., 1995] to reason about device behaviors. In an array representation, physical objects are mapped into a 2-D grid. Adjacency and containment of objects can be inferred directly from such a representation. Inference engines can then be augmented with diagrammatic reasoning operators which execute over the array (e.g. boundary following, rotation).

Other authors have argue that diagrams are useful for more than just spatial reasoning. Koedinger [Koedinger, 1992] argued that diagrams can support and optimise reasoning since they can model whole-part relations. Kindfield [Kindfield, 1992] studied how diagram used changes with expertise level. According to Kindfield, diagrams are like a temporary swap space which we can use to store concepts that (1) don't fit into our head right now and (2) can be swapped in rapidly; i.e. with a single glance. Goel [Goel, 1992] studied the use of ill-structured diagrams at various phases of the process of design. In a well-structured diagram (e.g. a picture of a chess board), each visual element clearly denotes one thing of one class only. In a ill-structured diagram (e.g. an impressionistic charcoal sketch), the denotation and type of each visual element is ambiguous. In the Goel study, subjects explored

preliminary design,
design refinement, and
design detailing

using a well-structured diagramming tool (MacDraw) and a ill-structured diagramming tool (freehand sketches using pencil and paper). Free-hand sketches would generate many variants. However, the well-structured tool seemed to inhibit new ideas rather than help organise them. Once something was recorded in MacDraw, that was the end of the evolution of that idea.

One gets the feeling that all the work is being done internally and recorded after the fact, presumably because the external symbol system (MacDraw) cannot support such operations. [Goel, 1992]

Goel found that ill-structured tools generated more design variants (i.e. more drawings, more ideas, more use of old ideas) than well-structured tools. We make two conclusions from Goel's work. Firstly, at least for the preliminary design, ill-structured tools are better. Secondly, after the brain-storming process is over, well-structured tools can be used to finalise the design.

Evaluating the Arguments for VP

It is not clear which of the above advantages apply to general software or knowledge engineering. Many software engineering or knowledge engineering problems are not naturally two-dimensional. For example, while we write down an entity-relationship diagram on the plane of a piece of paper, the inferences we can draw from that diagram are not dependent on the physical position of (e.g.) an entity.

In terms of the ill-structured/well-structured division, the VP tools I have seen in the KA field are all well-structured tools. That is, they are less suited to brain-storming than producing the final product.

Jarvenpaa and Dickson (hereafter, JD) report an interesting pattern in the VP literature [Jarvenpaa & Dickson, 1988]. In their literature review on the use of graphics for supporting decision making, they find that most of the proponents of graphics have never tested their claims. Further, when those tests are performed, the results are contradictory and inconclusive. For example:

JD cite 11 publications arguing for the superiority of graphics over tables for the purposes of elementary data operations (e.g. showing deviations, summarising data). None of these publications tested their claims. Such tests were performed by 13 other publications which concluded that graphics were better than tables (37.5 percent), the same as tables (25 percent), or worse than tables (37.5 percent)
JD cite 11 publications arguing for the superiority of graphics over tables for the purposes of decision making (e.g. forecasting, planning, problem finding). None of these publications tested their claims. Such tests were performed by 14 other papers which concluded that graphs were better than tables (27 percent), the same as tables (46 percent), or worse than tables (27 percent).

Similar contradictory results can be found in the study of control-flow and data-flow systems.

The utility of flowcharts for improving program comprehension, debugging, and extensibility was studied by Shneiderman [Shneiderman, 1983]. Shneiderman found no difference in the performance of the subjects using/not using control-flow diagrams.
On the other hand, recent results have been more positive [Scanlan, 1989].
Studies have reported that Petri nets are comparatively worse as specification languages when compared to pseudo-code [Boehm-Davis & Fregly, 1985] or E-R diagrams [Swigger & Brazile, 1989].
On the other hand, another study suggests that Petri nets are better than E-R diagrams for the maintenance of large expert systems [Swigger & Brazile, 1989].

Given these conflicting results, all that can conclude at this time is that the utility of control-flow or data-flow visual expressions are an open issue.

In other studies, the Green group explored two issues: superlativism and information accessibility (defined above). Subjects attempted some comprehension task using both visual expressions and textual expressions of a language. The Green group rejected the superlativism hypothesis when they found that tasks took longer using the graphical expressions than the textual expressions. The Green group also rejected the information accessibility hypothesis when they found that novices had more trouble reading the information in their visual expressions than experts. That is, the information in a diagram not instantly comprehensible and universal. Rather, such information can only be accessed after a training process.

The Moher group performed a similar study to the Green group. In part, the Moher study used the same stimulus programs and question text as the Green group. Whereas the Green group used the LABVIEW data-flow system, the Moher group used Petri nets. The results of the Moher group echoed the results of the Green group. Subjects were shown three variants on a basic Petri net formalism. In no instance did these graphical languages outperform their textual counterparts.

The Moher group caution against making an alternative superlativism claim for text; i.e. text is better than graphics. Both the Moher and Green groups distinguished between sequential programming expressions such as a decision true and circumstantial programming expressions such as a backward-chaining production rule. Both sequential and circumstantial programs can be expressed textual and graphically. The Moher group comments that:

Not only is no single representation best for all kinds of programs, no single representation is ... best for all tasks involving the same program. [Moher et al., 1993]

Sequential programs are useful for reasoning forwards to perform tasks such as prediction. Circumstantial programs are output-indexed; i.e. the thing you want to achieve is accessible separately to the method of achieving it. Hence, they are best used for hypothesis-driven tasks such as debugging.

VP as Explanation

The core of the case for VP is something like VP lets us explain the inner workings of a system at a glance. This section explores the issue of VP and explanation using the BALSA system.

In the BALSA animator system [Brown & Sedgewick, 1985], students can (e.g.) contrast the various sorting algorithms by watching them in action. Note that animation is more than just tracing the execution of a program. Animators aim to explain the inner workings of a program. Extra explanatory constructs may be needed on top of the programming primitives of that system. For example, when BALSA animates different sorting routines, special visualisations are offered for arrays of numbers and the relative sizes of adjacent entries.

Animators like BALSA may or may not be pure VP systems. BALSA does not allow the user to modify the specification of the animation. To do so requires extensive textual authoring by the developer. BALSA therefore does not satisfy the Rule 2 of pure VP system (defined above).

One drawback with the BALSA system is that its explanations must be hand-crafted for each task. General principles for explanation systems are widely discussed in AI. Wick and Thompson [Wick & Thompson, 1992] report that the current view of explanation is more elaborate than merely print the rules that fired or the how and why queries of traditional rule-based expert systems. Explanation is now viewed as an inference procedure in its own right rather than a pretty-print of some filtered trace of the proof tree. In the current view, explanations should be customised to the user and the task at hand. For example:

Paris [Paris, 1989] describes an explanation algorithm that switches from process-based explanations to parts-based explanations whenever the explanation procedure enters a region which the user is familiar with.
Leake [Leake, 1993] selects what to show the user using eight runtime algorithms. For example, when the goal of the explanation is to minimise undesirable effects, the selected structures are any pre-conditions to anomalous situations. Leake's explanation algorithms require both a cache of prior explanations and (like Paris) an active user model.

Summarising the work of Wick and Thompson, Leake, and Paris, I diagnosis the reason for the lack of generality in BALSA's explanation system as follows. BALSA's explanation systems were hard to maintain since BALSA lacked:

The ability to generate multiple possible explanations;
An explicit user model
A library of prior explanations;
A mechanism for using (2) and (3) to selectively filter (1) according to who is viewing the system.

Summary

On the positive side, visual systems are more motivating for beginners than textual systems. In the case of spatial reasoning problems, a picture may indeed be worth 10,000 words [Larkin & Simon, 1987]. Given some 2-D representation of a problem (e.g. an array representation), spatial reasoning can make certain inferences very cheaply. Also, ill-structured diagramming tools are a very useful tool for brainstorming ideas.

On the negative side, beyond the above three specific claims, the general superlativist case for VP is not very strong. Many software engineering and knowledge engineering problems are not inherently spatial. Most of the VP systems I am aware of do not support Goel's ill-structured approach to brainstorming. The JD research suggests that claims of the efficacy of VP systems have been poorly documented. The Moher and Green groups argue that VP evaluations cannot be made in isolation to the task of the system being studied. Lastly, a diagram may not necessarily support information accessibility for knowledge. A good explanation device requires far more than impressive graphics (recall the BALSA case study). Like many of our current approaches for knowledge engineering [Menzies, 1997b, Menzies et al., 1997, Menzies, 1997a], VP systems need to be better evaluated.

References

Boehm-Davis & Fregly, 1985: Boehm-Davis, D. & Fregly, A. (1985). Documentation of Concurrent Programs. Human Factors, 27:423-432.
Brown & Sedgewick, 1985: Brown, M. & Sedgewick, R. (1985). Techniques for Algorithm Animation. IEEE Software, pages 28-39.
Brown & Kimura, 1994: Brown, T. & Kimura, T. (1994). Completeness of a Visual Computation Model. Software- Concepts and Tools, pages 34-48.
Glasgow et al., 1995: Glasgow, J., Narayanan, H., & (eds), B. C. (1995). Diagrammatic Reasoning : Cognitive and Computational Perspectives. MIT Press.
Glinert & Tanimoto, 1984: Glinert, E. & Tanimoto, S. (1984). Pict: An Interactive Graphical Programming Environment. IEEE Computer, pages 7-25.
Goel, 1992: Goel, V. (1992). ``Ill-Structured Diagrams'' for Ill-Structured Problems. In Proceedings of the AAAI Symposium on Diagrammatic Reasoning Stanford University, March 25-27, pages 66-71.
Green et al., 1991: Green, T., Petre, M., & Bellamy, R. (1991). Comprehensibility of Visual and Textual Programs: The Test of Superlativism Against the ``Match-Mismatch'' Conjecture. In Empirical Studies of Programmers: Fourth Workshop, pages 121-146.
Hirakawa & Ichikawa, 1994: Hirakawa, M. & Ichikawa, T. (1994). Visual Language Studies - A Perspective. Software- Concepts and Tools, pages 61-67.
Jarvenpaa & Dickson, 1988: Jarvenpaa, S. & Dickson, G. (June 1988). Graphics and Managerial Decision Making: Research Based Guidelines. Communications of the ACM, 31(6):764-774.
Kindfield, 1992: Kindfield, A. (1992). Expert Diagrammatic Reasoning in Biology. In Proceedings of the AAAI Symposium on Diagrammatic Reasoning Stanford University, March 25-27, pages 41-46.
Koedinger, 1992: Koedinger, K. (1992). Emergent Properties and Structural Constraints: Advantages of Diagrammatic Representations for Reasoning and Learning. In Proceedings of the AAAI Symposium on Diagrammatic Reasoning Stanford University, March 25-27, pages 154-159.
Kremer, 1998: Kremer, R. (1998). Visual Languages for Knowledge Representation. Submitted to KAW'98: Eleventh Workshop on Knowledge Acquisition, Modeling and Management, Voyager Inn, Banff, Alberta, Canada. To appear.
Larkin & Simon, 1987: Larkin, J. & Simon, H. (1987). Why a Diagram is (Sometimes) Worth Ten Thousand Words. Cognitive Science, pages 65-99.
Leake, 1993: Leake, D. (1993). Focusing Construction and Selection of Abductive Hypotheses. In IJCAI '93, pages 24-29.
Menzies, 1996: Menzies, T. (1996). Visual Programming, Knowledge Engineering, and Visual Programming. In Proceedings of the Eighth International Conference on Software Engineering and Knowledge Engineering. Knowledge Systems Institute, Skokie, Illinois, USA ISBN 0-9641699-3-2. Available from http://www.cse.unsw.edu.au/ timm/pub/docs/96seke.ps.gz.
Menzies, 1997a: Menzies, T. (1997a). Evaluating Issues with Critical Success Metrics. In Banff KA '98 workshop. Available from http://www.cse.unsw.EDU.AU/ timm/pub/docs/97evalcsm.
Menzies, 1997b: Menzies, T. (1997b). Evaluation Issues for Problem Solving Methods. Banff KA workshop, 1998. Available from http://www.cse.unsw.edu.au/ timm/pub/docs/97eval.
Menzies et al., 1997: Menzies, T., Cohen, R., Waugh, S., & Goss, S. (1997). Evaluating Conceptual Qualitative Modeling Languages. In Submitted to the Banff KAW '98 workshop. Available from http://www.cse.unsw.EDU.AU/ timm/pub/docs/97evalcon.
Mintzberg, 1975: Mintzberg, H. (1975). The Manager's Job: Folklore and Fact. Harvard Business Review, pages 29-61.
Moher et al., 1993: Moher, T., Mak, D., Blumenthal, B., & Leventhal, L. (1993). Comparing the Comprehensibility of Textual and Graphical Programs: The Case of Petri Nets. In Empirical Studies of Programmers: Fifth Workshop, pages 137-161.
Narayanan et al., 1995: Narayanan, N. H., Suwa, M., & Motoda, H. (1995). Behaviour Hypothesis from Schematic Diagrams. In Glasgow, J. & N.H. Narayanan, B. C., (Eds.), Diagrammatic Reasoning, pages 501-534. The AAAI Press.
Paris, 1989: Paris, C. (1989). The Use of Explicit User Models in a Generation System for Tailoring Answers to the User's Level of Expertise. In Kobsa, A. & Wahlster, W., (Eds.), User Models in Dialog Systems, pages 200-232. Springer-Verlag.
Scanlan, 1989: Scanlan, D. (1989). Structured Flowcharts Outperform Psuedocode: an Experimental Comparison. IEEE Computer, 6(5):28-36.
Shneiderman, 1983: Shneiderman, B. (1983). Direct Manipulation: A Step Beyond Programming Languages. Computer, pages 57-69.
Swigger & Brazile, 1989: Swigger, K. & Brazile, R. (1989). Experimental Comparisons of Design/Documentation Formats for Expert Systems. International Journal of Man-Machine Studies, 31:47-60.
Wick & Thompson, 1992: Wick, M. & Thompson, W. (1992). Reconstructive Expert System Explanation. Artificial Intelligence, 54:33-70.
Williams et al., 1993: Williams, M., Ledder, W., Buehler, J., & Canning, J. (1993). An Empirical Study of Visual Labs. In Proceedings 1993 IEEE Symposium on Visual Languages, pages 371-373. IEEE Comput. Soc. Press.