Autonomic Computing

Programs crash, people make mistakes, networks grow and change. That¿s life, and computer scientists are finally building systems that can deal with it

By W. Wayt Gibbs

Computer hardware increases in speed and capacity by factors of thousands each decade; computer software piles on new features and fancier interfaces nearly as fast. So why do computers still waste our time and drive us crazy?

One quarter of those under age 25 polled in a recent British survey said they had kicked their computers or seen friends do so. And the cost of sophisticated networked systems (on which nearly all large organizations are coming to depend) is now dominated not by ever-cheaper hardware and software but by the rising salaries of the gurus who can keep it all up and running. According to a study published in March 2002 by researchers at the University of California at Berkeley, the labor costs outstrip equipment by factors of three to 18, depending on the type of system. And one third to one half the total budget is spent preventing or recovering from crashes. And no wonder: a system failure at a brokerage or credit-card authorization center can run up millions of dollars per hour in lost business.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Computer Crisis

"There is no less than a crisis today in three areas: cost, availability and user experience," says Robert Morris, director of IBM¿s Almaden Research Laboratory. At a conference in Almaden, Calif., last month, research leaders from most of the largest computer companies and several universities agreed on the problem as it was sketched out in a "manifesto" released last October by IBM. "The growing complexity of the I.T. infrastructure threatens to undermine the very benefits information technology aims to provide," the anonymously authored manifesto asserts. The sheer number of computer devices is forecasted to rise at a compound rate of 38 percent a year; most of these devices will be connected to one another other and to the Internet. "Up until now, we've relied mainly on human intervention and administration to manage this complexity," the manifesto continues. "Unfortunately, we are starting to gunk up the works."

There is less agreement on the solution. IBM argues in its treatise that the goal should be "autonomic" computer systems analogous to the involuntary nervous system that allows the human body to cope with environmental change, external attack and internal failures. "Our bodies have great availability," Morris observes. "I have soft errors all the time: my memory fails once in a while, but I don¿t ¿crash.¿ My whole body doesn¿t shut down when I cut a finger."

Morris and the other heads of IBM¿s autonomic computing research effort have more in mind than just fault tolerance. The manifesto lists eight defining characteristics (right) of autonomic computing systems. Some have already been demonstrated in prototypes.

An autonomic system must have a sense of self, for example. It must keep track of its parts, some of which may be borrowed from or lent out to other systems. And it must keep its public and private parts segregated. At Columbia University, Gail Kaiser and colleagues in the Programming Systems Lab have worked out ways to add software probes, gauges and configuration controls to certain kinds of existing systems so that they can be monitored, tuned and even repaired automatically rather than by highly paid engineers.

Autonomic systems should also be able to heal, to recover from damage by some means other than a suicidal crash. Armando Fox and co-workers at Stanford University have demonstrated one way to accomplish this. Fox redesigned a satellite ground station system so that every subsystem can be rebooted independently if--or rather, when--it gets knocked offline. The system still goes down occasionally, but now it can resume operation in six seconds rather than 30. The same principle, called recursive restartability, could be applied to many kinds of complex systems to prevent small glitches from accumulating and cascading into full-blown outages.

Possible Solutions

CHARACTERISTICS OF AUTONOMIC SYSTEMS

(as IBM sees them)

Possesses a sense of self.

Adapts to changes in its environment.

Strives to improve its performance.

Heals when it is damaged.

Defends itself against attackers.

Exchanges resources with unfamiliar systems.

Communicates through open standards.

Anticipates users¿ actions.

Experimental Systems

Oc¿ano, an experimental autonomic system under construction at IBM¿s Thomas J. Watson Research Center, includes the first two characteristics as well as a third: it actively strives to improve its performance. Oc¿ano manages a complex of servers using optimization algorithms to figure out the best way to distribute tasks and the cheapest places to store data. It tries to anticipate demand and to make the computers in its command ready just before they are needed.

Researchers at Hewlett-Packard Labs are working on similar projects, which they refer to as planetary computing. At U.C. Berkeley, the buzzword is "recovery-oriented computing," or ROC (as in "solid as a"). But David Patterson and others in the Berkeley group are not entirely comfortable with the idea of computer systems that hide all their complex operations from their human operators. To Patterson, the goal is not to build a HAL 9000, whose unpredictable behavior can be stopped only by pulling the plug, but rather to imitate the computer of the starship Enterprise, whose innards are still accessible--and comprehensible--to engineers.

As an example of recovery-oriented computing, the Berkeley group has built a prototype e-mail system with an "undo" feature. The program surrounds a standard e-mail server and records all its activity and any changes made to its configuration. If an administrator accidentally deletes a user¿s mailbox, or sets up a filter that tosses out good mail with spam, or if a virus gets loose and starts sending out mail willy-nilly, the operator can restore the server and all its mail by "rewinding" the system back to an earlier point, repairing the mistake and then "replaying" the events in fast motion. This approach consumes a lot of disk space. For the 1,270 users in Patterson¿s department, for example, the system will use three 120-gigabyte disks in the course of a year. But disks are now so inexpensive ($180 each in Berkeley¿s case) that the benefits easily justify the costs.

And that perhaps is the real significance of computer scientists¿ newfound appreciation for just how flaky most computer systems are. For the past 30 years, the mantra of the field was faster, bigger and cheaper. Slowly, and finally, the goal may be changing to easier, more reliable and less deserving of a swift kick.

RELATED LINKS:

Articles

The Worldwide Computer

The Semantic Web

Brave New OS

Computer, Heal Thyself

IBM Leads Charge on Holistic Computing

Project Pages

David Patterson's Web page

IBM's Project eLiza

IBM pages on Autonomic Computing

U.C. Berkeley Oceanstore Project

UMBC eBiquity Group

Stanford Software Infrastructure Group

The Berkeley/Stanford Recovery-Oriented Computing Project

It’s Time to Stand Up for Science

If you enjoyed this article, I’d like to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and right now may be the most critical moment in that two-century history.

I’ve been a Scientific American subscriber since I was 12 years old, and it helped shape the way I look at the world. SciAm always educates and delights me, and inspires a sense of awe for our vast, beautiful universe. I hope it does that for you, too.

If you subscribe to Scientific American, you help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both budding and working scientists at a time when the value of science itself too often goes unrecognized.

In return, you get essential news, captivating podcasts, brilliant infographics, can't-miss newsletters, must-watch videos, challenging games, and the science world's best writing and reporting. You can even gift someone a subscription.

There has never been a more important time for us to stand up and show why science matters. I hope you’ll support us in that mission.

Thank you,

David M. Ewalt, Editor in Chief, Scientific American