Traditional engineering approaches to raising the reliability of computer systems largely ignore the possibility of operator error. But in many cases, human mistakes account for more downtime (time during which the system is not functioning) than hardware problems or software bugs. The pie chart (right) depicts a breakdown of typical failure causes for three Internet sites.
For many industries, computer system downtime can be costly or even life-threatening. Engineers call the proportion of time a computing system functions correctly its availability, which is measured by "nines" (graph). A system that runs without crashing 99.999 percent of the time, for example, has an availability of "five nines," which corresponds to about two hours of downtime over 25 years of operation. Rather than seeking to reduce the number of failures, proponents of recovery-oriented computing advocate methods to shorten the time needed to bring systems back online. Boosting availability from two nines to five nines, for instance, shrinks total recovery time from 90 hours to five minutes a year.