On August 14, 2003, at least 50 million people lost the ability to cool their homes, refrigerate food, light offices, compute and commute, along with the myriad other necessities electricity provides in the modern world. A failed power line in Ohio set off a cascade of events that triggered the largest blackout in North American history and crippled much of the northeastern U.S. for two days.

In the year following the disaster the U.S.–Canada Power System Outage Task Force convened to figure out the causes of what happened.

Just prior to the 10th anniversary Scientific American spoke with electrical engineer Jeff Dagle, a member of the task force and a specialist in power-grid resilience at Pacific Northwest National Laboratory, to find out what we know now that we didn’t then, and whether similar mishaps could still happen.

[An edited transcript of the interview follows.]

What happened on August 14, 2003?
The blackout itself, which was a big one, affected 50 million people and 60,000 megawatts with an estimated economic impact of $10 billion. It started at 3:05 P.M. on August 14. A power line tripped [went offline] in northern Ohio. It was actually carrying less [electricity] than it was capable of so it should not have tripped, but trees under the line had gotten too close to the wire. It's just energized aluminum suspended in a wire and it relies on the air to provide insulation. If something gets too close it will arc and short-circuit.

So we lost a 345-kilovolt line in northern Ohio. Normally the grid is designed to have enough resilience built into it that losing a single line doesn't have any impact. But on that day there was also a problem with the software in the control center. The utility [First Energy] that owns that line would normally be looking and taking preventative action, but they didn't even know the line had tripped.

Because of the loss of that first line, about 30 minutes later an adjacent line tripped due to a similar cause. As with any heavily loaded transmission line, the heat from the current heats the metal and it expands. The line sags down closer to the things below it. So it was within its rating but trees [again] had been allowed to grow too tall. Fifteen minutes later a third line tripped.

Now we have three key lines over a period of about 45 minutes that tripped offline. The grid isn't designed for that level of redundancy.

At this point, the power is still trying to flow. So you get this cascading sequence of events that picked up speed. Shortly after 4 P.M. this cascade progresses outside of northern Ohio. So it blacks out Akron and Cleveland and then works its way around to Detroit and works around Lake Erie, taking out Toronto. Then it works around to the northwest and much of New York State trips off along with a big chunk of Ontario, making it the largest blackout ever in North America.

You were on the committee that investigated the event. Were the trees the main problem?
Another key root cause was this loss of situational awareness, which went a little deeper than just a software glitch. We were curious why the operations center didn't start to put the clues together. In fact, it wasn't until the lights went out in the control room that they really understood the grid was in peril.

They were getting a lot of phone calls and activity suggesting that there was a problem but they didn't connect the dots. They allowed the system to fail over that span of about an hour.

There were other problems, too.
There is also this organizational layer above the utilities made up of reliability coordinators. That was a lesson learned from the 1996 west coast blackout. You need an organization to look across all the utilities. In this area there were two big utilities—First Energy and [American Electric Power]—and two reliability coordinators—[the Midwest Independent System Operator, or MISO] and [the Pennsylvania, Jersey, Maryland Power Pool]. MISO had its own software glitch that prevented their computer tool from assessing the overall risk to the system.

The fourth main cause was an inadequate understanding of the system. There was a study done back in the 1980s that kind of predicted this blackout. It predicted that if you had the grid operating where power was flowing from south to north, as it was on August 14, and voltage fell below a certain point and you lost key lines, then you would get this cascading sequence of events that would cause problems around Lake Erie and in New England. So the report recommended that we not allow the voltage to go below a certain threshold, roughly 95 percent. Yet the voltage was being operated below that regularly on some key stations.

Why were they operating at the low voltage?
When we asked First Energy about that, their response was: “What study?” It had gotten lost in the passage of time and mergers and acquisitions. The current folks just weren't aware of this type of limit on the system. Had they maintained that defensive operating stance recommended in the study, this sequence couldn't have even started in the first place. Together, all these causes conspired to cause the blackout of that day.

Do we not have the right rules or did people not follow them? Why couldn't the system react fast enough?
It was kind of a combination of both. So there were these [tree-trimming] shortcomings but there wasn't really a national standard at that time. It was up to each individual utility to make its own decision abut how best to maintain reliability. It was tempting to do deferred maintenance if you needed to find a way to trim the budget and not break any rules. As a result, the 2005 Energy Policy Act gave [the Federal Energy Regulatory Commission] the ability to regulate reliability, and now they have mandatory reliability standards with financial penalties for noncompliance.

I can cut a human operator a lot of slack. A human operator needs time to assess the situation and figure out the appropriate response, then take action. But if it rolls out over the course of an hour, you've got to wonder: Were they just so dependent on their automated alarms? Does that boil down to training or complacency? I'm not sure what the root cause there was.

Were there any big surprises that came out of your investigation?
Initially, people were surprised by the scope of the blackout. That was sort of shocking. People, including me, didn't think a blackout that big was likely—that some seemingly benign root causes conspired to suddenly leave 50 million people out of power.

So what were the big fixes to the grid in the wake of this massive blackout?
The most significant thing is the mandatory reliability requirements and new standards for vegetation management.

We've also come a long way in terms of technology. I remember touring a control room recently. The senior vice president in charge of operations pointed to their big new display wall with the digital display of their system. He smiled and said: “That's blackout money.” Utilities invested significant sums of money to spruce up control rooms. They saw the lesson of the lack of situational awareness.

There's also a technology I'm personally involved with called synchrophasors. It's better at measuring the grid to really understand what is happening. It takes advantage of a common time reference. Equipped with GPS, this technology gives you microsecond accuracy of time across the whole power system. … The measurement it allows gives a direct indicator of the stress on the grid. The analogy I use is it's like going from x-rays to MRIs. In the old days utilities gathered data every four seconds. Some were as fast as every two seconds. That's adequate if you're just feeding information to a human operator but it's not time synchronized. With the new [synchrophasors] installed in substations, these things use higher sampling rates. They run about 30 samples per second. Some are even faster than that.

But there were no big changes to the grid itself?
The American Recovery and Reinvestment Act of 2009 provided some grant money for smart-grid technologies. But in terms of the grid itself, not so much. There were no fundamental changes in the way the grid is operated. You still have power lines and transformers and, mostly, central generation [power plants]. At the transmission level, it's pretty similar technology to what we had 10 years ago.

Has the grid become just too big to handle or is the answer to make an even bigger grid?
The bigger the grid is, the more reliable it is. Think about how the 2003 blackout progressed. The eastern grid has a peak of 800,000 megawatts but it was only 60,000 megawatts that got knocked out. It was only a small chunk of the grid. The answer is to compartmentalize part of the grid.

So would we have a unified grid someday across all of North America to make the system even more secure? The question on that is the economics of building out the infrastructure. Say we do build out wind power in the Dakotas and Wyoming, right along the seam of the eastern and western grid, and build transmission to get that wind [-generated electricity] to cities? Maybe someday it would make sense to connect east and west but there is no compelling reason right now to build a bunch of power lines because the cost-benefit just isn't there.

But distributed generation, like solar panels on peoples' rooftops, seems to present a fundamental challenge to this centralized-grid concept?
The infrastructure itself is a very complex machine. And we're integrating a lot of variable generation, wind and solar and things like that. We're seeing a lot more natural gas because of its price and the retirement of coal-fired power plants. There are also new types of loads we haven't seen before [like electric cars charging at night.] There are a lot of changes we're going to see going forward and we've already been seeing. That change creates risk.

Are we on the threshold of some sort of real major disruption if the price of photovoltaics drops to the point where nobody wants to buy power from utility companies and just self-generate? What is the grid going to look like then?

Personally, I believe we want a big grid with the ability to pool resources. I don't believe we're going to abandon that, but that model is going to face some business challenges. Utilities that operate the system are going to be facing some fundamental challenges to continue to do so. It's going to take a lot of effort by everyone from regulators to customers to suppliers to work that out.

When will a megablackout happen again?
There is a certain periodicity between these blackouts. A certain amount of time passes and maybe people forget and get more complacent. Before 2003 there was the August 1996 blackout on the west coast. And there was the San Diego [Arizona–California] blackout in 2011.

I would never say we're never going to have another blackout. We are working on technologies to minimize the chances of a blackout. The objective is to increase reliability. I just hope we put in place the right sort of technologies to minimize the impact.