by Kenneth P. Birman and Robbert van Renesse
Building Reliable Bridges
Saved by the Backup
Crisis of Will?
Surfing the Internet is no longer just a seductive pastime. In rising numbers, organizations of all kinds, from computer companies to publishing firms, are turning to on-line services that operate much like the World Wide Web. These services can help manage important information, speed decision making and improve efficiency. But as more and more enterprises become dependent on this new technology, many are also exposed to the downside of computer networking. The drawbacks are particularly evident to users of distributed computing systems, which link programs, servers and data files dispersed across an extended network of computers and terminals.
As every computer user knows, programs that operate across networks are prone to failure. Indeed, Leslie B. Lamport, a pioneer in distributed computing at Digital Equipment Corporation, defined a distributed computing system as "one in which the failure of a computer you didn't even know existed can render your own computer unusable." The Web is certainly not exempt from breakdowns, as one of us (Birman) explains in a related story. During late 1995, users of the Web reported several "brownouts," when communication on the Internet was largely impossible. Such lapses have been variously attributed to software errors, excessive traffic on transmission lines, and overload or complete failure of the Web servers, which are computers that store the documents users access from their workstations. Most likely, a combination of factors contributes to brownouts. Unfortunately, similar events will multiply as computer networks-not just the World Wide Web but also distributed computing systems serving banks, schools and many offices-continue to expand.
When computers crash, sometimes the only casualties are the user's time and temper. If the automated bank teller nearest you is not working, the one across the street from it may be. But the shutdown of very complicated networks can have dire consequences. On July 15, 1994, the NASDAQ stock exchange, an exclusively electronic stock market, opened two hours late because of a mysterious problem that compromised the entire system. Initially, workers thought a software bug triggered the shutdown, but the error was ultimately traced to a malfunctioning disk. Because trading was delayed only for a few hours, little revenue was lost. Yet the event could have been a catastrophe: the market would have faced enormous losses had trading not resumed when it did.
In another example, from January 1990, the AT&T telephone system experienced a large-scale outage when an electronic switch within the system failed. Calls were automatically shifted to a backup switch, but this switch also crashed because of a software bug. The failure rippled through the network, resulting in a nationwide shutdown that lasted for nine hours, during which 60,000 people lost all telephone service and 70 million telephone calls could not be completed. To anyone familiar with the challenges of managing even a simple network, the surprise is not that these mishaps occur but rather that they are not more frequent.
Even a brief distributed systems failure can pose a significant problem for applications that require around-the-clock operation. Air-traffic-control and financial computer networks must be exceedingly reliable and constantly updated. A message that a "host is not responding" or a misleading display that shows out-of-date information about an airplane's flight path or a stock's price could easily provoke an accident or financial misadventure. As the way people live and work continues to be transformed, the security and stability of their finances, property and even health will increasingly depend on distributed computer systems. Thus, although it is easy to talk about the potential benefits of the information superhighway, we believe the bridges that link computers must be inspected more closely. Various computer scientists, including the two of us, have been working since the late 1970s on developing software to improve distributed computer networks, making them more secure and resistant to failure-an activity that people in the field refer to as designing robust distributed computing systems.
Why do distributed systems crash? If we exclude systems that fail because they were mismanaged or poorly designed, the most common scenario involves an isolated problem at one site that triggers a chain of events in which program after program throughout the network eventually shuts down. One response to this threat might be to strengthen individual components-incorporating computers and disks specially designed to tolerate faults, for example. But ceilings can still leak, causing short circuits; power can fluctuate; and communications connections can be inadvertently cut. Acts of sabotage by hackers or disgruntled employees can also endanger distributed systems. Although engineers and computer programmers can improve the durability of hardware and software, no computer can ever be made completely reliable.
Even if every component of a system were extremely dependable, the story would not end there. Merely interconnecting reliable computers and bug-free programs does not yield a robust distributed system. Instead it produces a network that works well under most conditions. Electronic-mail programs, bulletin boards and the Web were designed using components that, considered individually, are very trustworthy. Yet these systems frequently freeze when anything unexpected happens to an individual component of the system; for instance, the system may crash when one machine or a communications line becomes overloaded. Some additional form of protection is therefore needed.
During the past two decades, programmers have attacked the dependability problem by developing fault-tolerant software-programs that allow computer systems to restore normal operation even when problems occur. The technique eliminates the chains of internal dependencies that link the operation of a system as a whole to the operation of any single component. The resulting systems do not need to shut down even if some sites go off-line. Instead they resume service by rapidly reconfiguring to work around crashed servers.
Computer scientists refer to these arrangements as highly available distributed systems. Because these systems are designed to replicate critical information continuously and to distribute multiple backup copies among their individual computers, they can adapt to changing conditions-a malfunctioning disk drive at one site, an overload at another, a broken communications connection and so forth. As long as failures do not occur so often that the software lacks time to react, these systems can respond by pulling up from elsewhere a duplicate copy of a needed file or a replica of an on-line program. In this way, a system as a whole remains available and, ideally, provides uninterrupted service to the users still connected.
A simple and popular method of building a highly available distributed system involves a primary and a backup system. If the primary machine fails, the backup can be called into service. Switching between the two is easy if the data never change. The conversion becomes difficult, however, if data or files change while the system is running. And in an extensive network of servers, data, files and programs, it can be difficult to distinguish between a system that has genuinely crashed and one that is merely experiencing communications difficulties.
Suppose that a computer is trying to update information on both the primary and backup servers, but one of them stops responding to messages. If the problem is merely in the communications lines, the messages will get through, given enough time. But if the server has actually failed, the computer doing the updating would wait indefinitely; in the meantime, the system would be unavailable. If the computer trying to carry out the update inappropriately stops waiting and sends the update to only one server, however, the primary and the backup will no longer be identical. Errors will arise if the system attempts to use the outdated server.
The NASDAQ financial market illustrates one way to resolve this conundrum. The network has two central trading servers. To prevent confusion, only one is active at any given time. The NASDAQ operators themselves decide when to switch to the replacement server. Unfortunately, very few distributed systems can rely on the wisdom of a human operator to detect failures and then to switch the entire network from one server to another. Rather programmers must automate this decision so that the transition can occur seamlessly.
Moreover, highly available distributed systems often have large numbers of servers and programs. Consequently, these systems typically maintain a membership list, which keeps track of every program, noting whether it is working or not. If a program is unresponsive for any reason, it is marked as faulty. By recognizing a failure at one site, the system can then reconfigure itself and redirect work to operational sites.
The NASDAQ system also demonstrates a second concern about reliability in distributed systems. The two-hour trading delay in 1994 could have been avoided if the operators had switched immediately to the backup. They opted to wait, however, because of concerns that a software bug might have caused the primary system to malfunction. If such a bug were present, the backup might also crash, just as the AT&T backup system did. Because it is impossible to guarantee that software is completely free of bugs, some form of protection is needed to reduce the risk that backup versions of a critical server will crash following the failure of a primary server.
Programmers have responded to this challenge with an approach known as active replication. In active replication, a system's software establishes redundant copies of vital programs or servers through the use of so-called process groups. A process group links a set of programs that cooperate closely. A distributed system may contain many process groups, and programs can belong to several of these groups. Each group is known by a name much like a file name and has its own list of current members. Most important, the process group provides a means for sending messages to its members. This message-passing function ensures that each member of the group receives every message in the same order, even if the sender crashes while transmitting the message.
If a particular program is necessary for maintaining availability, the system introduces a group of programs, each of which replicates the original. To update the data managed by the replicated program, the system sends a message to the process group. Each member reacts by updating its particular replica. Because all the programs see the same updates in the same order, they will remain in mutually consistent states.
Active replication enables a system to tolerate faults because any group member can handle any request: if one machine crashes, work can be redirected to an operational site. Furthermore, if a request does not alter data, one site can process the query rather than tie up the entire system. In this way, multiple tasks can be worked on at once by different programs, speeding up the application by employing parallel processing.
Of course, if all members of a process group handle an incoming message in the same erroneous manner, all the members could, in theory, crash simultaneously. Although it would seem that active replication should be vulnerable to such failures, this turns out not to be the case. Programmers have often observed that the errors most likely to be missed in testing software are those involving the order in which data are received. These bugs can be provoked only by unlikely sequences or timings of events. When a system employs active replication, the replicas do see the same updates in the same order; however, updates are only a small part of the requests a program sees. Most of the time, replicated programs work in parallel, with each program handling its own set of queries in a unique order. Thus, even if a software bug slips through testing and interferes with a few parts of a networked application, it is unlikely to cause all the members of any particular process group to crash at the same time.
The idea behind active replication is simple, but the software needed to support it is not. Managing dynamically changing membership lists and communicating with process groups is difficult, particularly in the face of inevitable crashes and lost messages. Although distributed computing has become commonplace over the past decade, active replication has only recently emerged from the laboratory.
Over the past few years, more than a dozen software teams have developed packages for robust distributed computing systems. All provide high availability through active replication, although they each differ somewhat in their emphasis. Some packages focus on speed, for example; others on the need for security.
Our research efforts at Cornell University contributed two such packages. One of us (Birman) headed the team that introduced Isis in 1987; more recently, the two of us worked on Horus, introduced in 1994. Information about Isis and Horus is available on the Cornell server; there is also a separate Horus site. The names "Isis" and "Horus" allude to Egyptian mythology. The goddess Isis helped to revive the god Osiris after he was torn to pieces in a battle with the war god Set; Horus was Isis's son, who eventually triumphed over Set. By analogy, the Isis and Horus packages can help restore a distributed system that has been disrupted by a failure.
Packages such as Isis employ a set of software functions, or "tools," that replicate and update data, keep track of process groups and assist in handling membership changes. Isis can also parcel out data processing among servers (a procedure known as load sharing). Distributed systems that make use of load sharing exhibit many of the advantages of parallel computing but without requiring special-purpose parallel computers. By dividing up incoming work among multiple servers functioning in concert, Isis enables systems to manage large tasks quickly. Also, if a particular application requires additional computing power, one can add an extra server, and the load-sharing technique will adapt itself to the new group size. The possibility, offered by such tool kits as Isis and Horus, of improving both performance and reliability often surprises developers: they tend to assume that making a system more robust will also make it slower and more expensive.
Active replication has been applied in a number of settings, including several telecommunications networks, stock markets, banks and brokerages. In Norway, researchers have developed an environmental monitoring system based on the technology. The French air-traffic-control agency is also exploring the technique for use in a new generation of air-traffic-control software. And manufacturing plants have used process groups to coordinate work and to reconfigure assembly lines when equipment is taken off-line to be serviced.
As computer scientists look to ever more demanding applications, however, they discover that active replication has important limitations. Load sharing is not always possible or desirable: some systems (notably, those in which data stored at a server change very rapidly) slow down when components are replicated. For example, in videoconferencing technology, active replication does improve the fault tolerance of the network of servers that must keep running even when some participants are cut off. But the technique would slow down the system-without improving dependability-if applied to the transmission of video data to remote users.
The need for flexibility motivated us to develop Horus. Like the Isis tool kit, Horus supports active replication, but it also provides much greater versatility. The basic strategy behind Horus is modularity, resembling that of a child's set of Legos: different building blocks of Horus can fit together in any combination to support the specific needs of a particular process group. One block might encrypt data so that hackers cannot break into the system. Another block might address potential communications failures that can arise when messages are lost or corrupted. Programmers using Horus decide which properties their system actually needs, permitting them to customize the system for its intended use. Furthermore, Horus can be extended with custom-designed blocks for special needs that we may not have encountered or anticipated in our own work.
Horus has a growing group of users worldwide. At Cornell, Brian C. Smith has used it to build a videoconferencing system for "groupware" applications. Some projects related to Horus are described within the Horus site.
Our work on Isis and Horus has convinced us that careful planning can ensure the dependability of computer networks. But making the information superhighway robust may take more time and money than computer makers and users are willing to commit. Software for distributed applications is typically built with existing technology that was not designed for dependability. Moreover, researchers need to seek better methods for designing large-scale systems that are robust and that provide very high performance: a system that is extremely robust when accessed by 50 users simultaneously may turn out to be unacceptably slow and hence unreliable if 5,000 people do so.
Although programmers have applied the technology for robust distributed computing successfully in some instances, the public hears more about failures of nonrobust systems. For example, over the past few years, there have been dozens of reports on the problems with the current air-traffic-control system. In the fall of 1995 the Los Angeles system failed, leaving controllers unable to communicate with aircraft; a midair collision was avoided by seconds.
To make matters worse, updated air-traffic-control software, commissioned in 1982 by the Federal Aviation Administration, has been repeatedly delayed and scaled back.
To make matters worse, updated air-traffic-control software, commissioned in 1982 by the Federal Aviation Administration, has been repeatedly delayed and scaled back.The FAA selected the original proposal precisely because of its innovative approach to distributed computing; now it seems the highly available and distributed aspects of the proposed software have been almost entirely eliminated. Yet air-traffic controllers criticize the existing system dangerously inadequate, particularly because it lacks a distributed software architecture and has become undependable with age. Highly publicized fiascoes such as these have fueled a common perception that there is a crisis in computer software [see "Software's Chronic Crisis," by W. Wayt Gibbs; Scientific American, September 1994].
But if we are really in the midst of a software development crisis, it is perhaps as much a crisis of will as of means. Not all developers are concerned with making their networking software robust, and the public pressure for reliability does not seem to extend beyond a few especially sensitive applications. Indeed, companies that market distributed computing packages often state in product licenses that their technologies may not be dependable enough for use in critical applications-implying that reliability is not a reasonable objective. In our opinion, this situation is analogous to the unlikely prospect of automakers selling cars with the warning that vehicles are unsafe for use on highways. The computer equivalents of safety belts and air bags are infrequently applied to software development. And the desire for sophisticated, user-friendly interfaces as well as improved speed and performance tends to dominate the attention both of the software developers and the people who use the programs.
Reliability often conjures up an image of slow, ponderous computer systems that is incompatible with the allure of effortless and instantaneous access to information on the data superhighway. Yet robust technology does not have to be slow and unpleasant to use: the Golden Gate Bridge is a model of stability as well as grace. With each passing hour, more and more uses are being found for the information bridges that link computers. Our enthusiasm to incorporate elegant electronic bridges in every conceivable application should not overshadow a reasonable degree of concern about whether or not such bridges will be able to support the resulting traffic of information. We believe that robust distributed systems provide a valuable tool for connecting computers quickly and dependably, creating opportunities for business and pleasure in the information society. But we also believe that in many cases, unless a distributed system can be engineered to function robustly, it may be better not to build-or use-one at all.
Fault Tolerance in Tandem Computer Systems. Jim Gray, Joel Bartlett and Robert W. Horst in The Evolution of Fault-Tolerant Computing. Edited by A. Avizienis, H. Kopetz and J. C. Laprie. Springer-Verlag, 1987.
Fatal Defect: Chasing Killer Computer Bugs. Ivars Peterson. Random House, 1995.
Group Communication. Special section in Communications of the ACM, Vol. 39, No. 4, pages 50-97; April 1996.