Computer hardware increases in speed and capacity by factors of thousands each decade; computer software piles on new features and fancier interfaces nearly as fast. So why do computers still waste our time and drive us crazy?

One quarter of those under age 25 polled in a recent British survey said they had kicked their computers or seen friends do so. And the cost of sophisticated networked systems (on which nearly all large organizations are coming to depend) is now dominated not by ever-cheaper hardware and software but by the rising salaries of the gurus who can keep it all up and running. According to a study published in March 2002 by researchers at the University of California at Berkeley, the labor costs outstrip equipment by factors of three to 18, depending on the type of system. And one third to one half the total budget is spent preventing or recovering from crashes. And no wonder: a system failure at a brokerage or credit-card authorization center can run up millions of dollars per hour in lost business.

Computer Crisis

"There is no less than a crisis today in three areas: cost, availability and user experience," says Robert Morris, director of IBMs Almaden Research Laboratory. At a conference in Almaden, Calif., last month, research leaders from most of the largest computer companies and several universities agreed on the problem as it was sketched out in a "manifesto" released last October by IBM. "The growing complexity of the I.T. infrastructure threatens to undermine the very benefits information technology aims to provide," the anonymously authored manifesto asserts. The sheer number of computer devices is forecasted to rise at a compound rate of 38 percent a year; most of these devices will be connected to one another other and to the Internet. "Up until now, we've relied mainly on human intervention and administration to manage this complexity," the manifesto continues. "Unfortunately, we are starting to gunk up the works."

There is less agreement on the solution. IBM argues in its treatise that the goal should be "autonomic" computer systems analogous to the involuntary nervous system that allows the human body to cope with environmental change, external attack and internal failures. "Our bodies have great availability," Morris observes. "I have soft errors all the time: my memory fails once in a while, but I dont crash. My whole body doesnt shut down when I cut a finger."

Morris and the other heads of IBMs autonomic computing research effort have more in mind than just fault tolerance. The manifesto lists eight defining characteristics (right) of autonomic computing systems. Some have already been demonstrated in prototypes.

An autonomic system must have a sense of self, for example. It must keep track of its parts, some of which may be borrowed from or lent out to other systems. And it must keep its public and private parts segregated. At Columbia University, Gail Kaiser and colleagues in the Programming Systems Lab have worked out ways to add software probes, gauges and configuration controls to certain kinds of existing systems so that they can be monitored, tuned and even repaired automatically rather than by highly paid engineers.

Autonomic systems should also be able to heal, to recover from damage by some means other than a suicidal crash. Armando Fox and co-workers at Stanford University have demonstrated one way to accomplish this. Fox redesigned a satellite ground station system so that every subsystem can be rebooted independently if--or rather, when--it gets knocked offline. The system still goes down occasionally, but now it can resume operation in six seconds rather than 30. The same principle, called recursive restartability, could be applied to many kinds of complex systems to prevent small glitches from accumulating and cascading into full-blown outages.

Possible Solutions

(as IBM sees them)
  • Possesses a sense of self.
  • Adapts to changes in its environment.
  • Strives to improve its performance.
  • Heals when it is damaged.
  • Defends itself against attackers.
  • Exchanges resources with unfamiliar systems.
  • Communicates through open standards.
  • Anticipates users actions.

  • Experimental Systems

    Ocano, an experimental autonomic system under construction at IBMs Thomas J. Watson Research Center, includes the first two characteristics as well as a third: it actively strives to improve its performance. Ocano manages a complex of servers using optimization algorithms to figure out the best way to distribute tasks and the cheapest places to store data. It tries to anticipate demand and to make the computers in its command ready just before they are needed.

    Researchers at Hewlett-Packard Labs are working on similar projects, which they refer to as planetary computing. At U.C. Berkeley, the buzzword is "recovery-oriented computing," or ROC (as in "solid as a"). But David Patterson and others in the Berkeley group are not entirely comfortable with the idea of computer systems that hide all their complex operations from their human operators. To Patterson, the goal is not to build a HAL 9000, whose unpredictable behavior can be stopped only by pulling the plug, but rather to imitate the computer of the starship Enterprise, whose innards are still accessible--and comprehensible--to engineers.

    As an example of recovery-oriented computing, the Berkeley group has built a prototype e-mail system with an "undo" feature. The program surrounds a standard e-mail server and records all its activity and any changes made to its configuration. If an administrator accidentally deletes a users mailbox, or sets up a filter that tosses out good mail with spam, or if a virus gets loose and starts sending out mail willy-nilly, the operator can restore the server and all its mail by "rewinding" the system back to an earlier point, repairing the mistake and then "replaying" the events in fast motion. This approach consumes a lot of disk space. For the 1,270 users in Pattersons department, for example, the system will use three 120-gigabyte disks in the course of a year. But disks are now so inexpensive ($180 each in Berkeleys case) that the benefits easily justify the costs.

    And that perhaps is the real significance of computer scientists newfound appreciation for just how flaky most computer systems are. For the past 30 years, the mantra of the field was faster, bigger and cheaper. Slowly, and finally, the goal may be changing to easier, more reliable and less deserving of a swift kick.



    The Worldwide Computer

    The Semantic Web

    Brave New OS

    Computer, Heal Thyself

    IBM Leads Charge on Holistic Computing

    Project Pages

    David Patterson's Web page

    IBM's Project eLiza

    IBM pages on Autonomic Computing

    U.C. Berkeley Oceanstore Project

    UMBC eBiquity Group

    Stanford Software Infrastructure Group

    The Berkeley/Stanford Recovery-Oriented Computing Project