Robert W. Cummings
The electric industry can learn many lessons from the analysis of both system disturbances and blackouts. It is extremely important to analyze both to identify trends in system behavior that are initiating, causal, contributory, or merely coincidental. Those trends can help prioritize efforts to eliminate, or at least reduce, the probability of recurrence of such events.
For purposes of this discussion, the analyses are limited to system events on the bulk power system (BPS), not on the distribution system. Customer outages caused by weather events, such as storms, are not addressed. However, at times, the behavior and performance of distribution systems or the loads can factor into any system event. The article will also discuss such occurrences.
The major difference between a system disturbance and a blackout is the loss of load; system disturbances do not necessarily result in load loss, but blackouts do. For example, a major system disturbance in 2007 resulted in the tripping of nine generators totaling 4,457 MW across three states and three North American Electric Reliability Corporation (NERC) regions, but only 37 MW of local load was lost. Although this was a serious disturbance (it was the largest generation loss in the Eastern Interconnection other than in the major blackouts in 1965 and 2003), the event was not widely reported, and it was largely out of the public eye.
It is also incredibly important to cast a very wide net when preserving and collecting data. Initial data preservation requests must be done on an interconnection-wide basis, with data collection requests being able to focus on the area of highest impact. In the 2007 disturbance, most of the generation was lost in Indiana and Illinois, but four generators tripped or rapidly reduced output (174 MW total) in New York City as the result of turbine controls being sensitive to system frequency deviation (a minimum frequency of 59.863 Hz was recorded) caused by the tripping of the other generators. Those same generators had tripped in the 2003 blackout with no apparent reason discovered at that time (mystery solved).
Another very important aspect of event analysis is to constantly be on the alert for repeated occurrences of problems, even in small disturbances. Early detection of trends of minor problems can prevent them from becoming contributory to larger system disturbances. The sensitivity of the New York City generators to frequency deviations cited previously is a good example.
So, what is important in analyzing an event? Everything!
Some basic tenets of those analyses are the following:
The following four basic types of events occur in a system disturbance:
The analysis of an event should always be aimed at determining “actionable” root causes. Sometimes, too deep of an analysis may result in a root cause that is not correctable. For instance, one could argue that a root cause in the 2003 blackout was greed—someone wanted to save money on tree trimming, and they did not adequately trim trees in their rights-of-way. The blackout was initiated by lines sagging into the trees.
Protection systems are notoriously blamed when they misoperate (fail to operate when they should or operate unnecessarily). Wherever protection systems are concerned, correct wiring should be verified, and relay, digital fault recorder, or PMU data should be used to help determine what the relay actually “saw.”
Remember too that sometimes system elements can and do fail unexpectedly. Protection systems are often described as failing. In today’s age of digital relays, equipment failures themselves are rarer. Often a protection system “failure” can be traced back to a bad setting, a miscalculation of the setting, or a misapplication of the protection feature itself, not to an equipment failure.
There are several phrases used in explaining system events that need to be questioned, corrected, or become part of a recommended corrective action. The following are examples of phrases that should be questioned:
Yes, Alice, we are still going to have a transmission system in the future to serve large load centers, such as New York City and Los Angeles…there is simply not enough physical space for sufficient local solar and wind resources to serve those load centers.
Event analysis as a science has significantly changed since the 1965 Northeast blackout. Back then, there were no digital relays, oscillography was far more primitive (on paper recordings), the timing in a sequence of events was rarely measured down to milliseconds, and the industry was just getting its feet wet in digital simulations, which required the largest mainframe computers to perform even basic power flow and dynamic analyses.
Over the years, the electric system continues to evolve as renewable resources, such as solar and wind power, rapidly increase, and battery energy storage systems are added to the mix. All of those resources are coupled to the system by power electronics (inverters) and create new challenges to the engineering and operation of the electric system. When IBRs supplant retiring fossil-fueled generation (rotating machines), there is an inherent reduction of system inertia and synchronizing torque. Reduced fault current may require redesign and modification of transmission protection systems, and the distributed nature of the new resources makes the well-understood flows from central generating stations to load centers insufficient. In addition, the new resources often require new transmission lines to collect and distribute their output.
The very nature of the load composition and behavior is also changing as well. Traditionally, loads were largely resistive with some synchronous motors. However, models for those loads are now obsolete for accurate power flow and system dynamic simulations. The new load models must be capable of representing chargers for everything from phones to electric vehicles, LED lighting, variable-speed drive motors, microgrids, and the impacts of rooftop solar panels. The combination of embedded resources on the distribution system has made the load far less predictable and somewhat schizophrenic.
As the system resources and loads have changed, so too must the forensic analysis techniques for system disturbances. The high penetration of IBRs and load changes relies increasingly less on the physics of rotating machines and progressively more on the control systems for all of the power electronics coupling the loads and resources to the system. Today’s power system engineer had better be well-versed in controls engineering, particularly involving the controls of IBRs and electronically coupled loads.
The following discussion presents a sample of system disturbances over the last 20 years, highlighting what lessons were learned and how analyses have had to change.
Things had improved somewhat by the time of the 2003 Northeast blackout, with significant advances having been made in computer-based power flow and transient stability analyses. The NERC Reliability Coordinators were new functions, and their state estimation systems were still under development. There were several high-speed digital fault recorders and several digital relays in service as well as some sequence of events recorders, but none of them was synchronized to a time standard (such as a GPS synchronizing signal). That made it a monumental task to place the thousands of events during the disturbance (lines and generators tripping, relay actions, and so on) into a reasoned, detailed sequence of events and verify the causes of each of the trips. For the large part, operator logs, voice recordings, and data from 4-s scan rates from supervisory control and data acquisition (SCADA) systems were the best available. But even for those data, the time stamps were unreliable; time stamps were often added to the data when the computers “logged in” the data at the control centers.
Several man-months were spent on “daisy chaining” and correcting time stamps for the sequence of events. Fortunately, GPS-synchronized digital disturbance recorders had recently been installed on the Michigan–Ontario transmission interface, giving the engineers a time anchor for future disturbances (see Figure 1).
Events were often observed in the current or voltage domains at two nearby locations of digital fault recorders, allowing a time offset to be calculated between the two locations. That process was then repeated for a third and fourth location (and so on), allowing a crude but effective sequence of events to be created. This process was repeated for several parallel paths and was, eventually, verified by power flow and transient stability simulations.
Forensic power flow analysis was still an inexact science. For normal power flow solutions, one knows the real and reactive loads, generation, and shunt reactive devices (reactors and capacitors) at each bus. The power flow program then solves for voltages, phase angles, and line flows on the system elements. In a forensic analysis, voltages, loads, and line flows may be, at least, measured and recorded during the event, and the power flow program solves for the voltages and phase angles. The engineer must then iteratively adjust the real and reactive loads until the solved voltages and flows match more closely to the measured values. This process must focus on one set of relatively synchronized readings and must be done repeatedly to create a picture of the disturbance as it progresses. In the 2003 disturbance, forensic power flow analysis was possible for the hour before 4:10 Eastern Daylight Time (EDT) while several lines were tripping in the sequence of events.
Some of the key takeaways from that 2003 disturbance include the following:
The 2004 Westwing disturbance in Arizona highlighted the potential problems caused by a single point of failure within a transmission protection system. A single-line-to-ground fault on the 230-kV system failed to be cleared by the protection system at Westwing substation because of the failure of an auxiliary relay to trip a 230-kV breaker. The result was a protracted fault that lasted for almost 38.9 s! That fault was finally cleared by the tripping by overreaching backup relaying functions of a nearby directional distance protection system. The voltage, current, and frequency perturbations caused by the fault resulted in the tripping of about 4,610 MW of generation, some as far away as Alberta, Canada, and the loss of about 995 MW of load. The frequency in the Western Interconnection fell to 59.54 Hz.
This disturbance caused by the failure of a single auxiliary relay eventually caused the electric industry in North America to examine their protection systems for potential single points of failure. The duration of this fault also highlighted the truism that a single-line-to-ground fault left uncleared will eventually become a multiphase fault. But the event also set the stage for event analysis being wary of protracted faults and their potential to cause cascading of the electric system, which was further emphasized by several other disturbances discussed in this article.
One of the vexing problems developing for a system is the detection of forced oscillations caused by misbehaving control systems. In this event, the control system on a small (30-MVA) generator excited a 0.395-Hz oscillation on the California–Oregon Interties. Since the oscillation was close to the north–south natural system frequency (0.403 Hz), the resultant oscillation was observable throughout the Western Interconnection. The California Independent System Operator (CAISO) Reliability Coordinator modal analysis software detected the low damping on the North–South Mode B. Such forced oscillations can become more problematic as control systems of the increasing number of IBRs come into play.
This disturbance was of a different nature—virtually no load was lost—but it warranted careful analysis because about 4,260 MW of generation was tripped in the Eastern Interconnection, causing the frequency to dip to 59.864 Hz. The event was initiated by six single-line-to-ground faults on a 765-kV line equipped with single-pole switching and related to fast-valving action by the generators associated with that line.
The event solved the mysterious gas turbine trips in New York City mentioned in the 2003 Blackout analysis that had controls susceptible to frequency deviations. The event also highlighted the susceptibility of some generation to trip due to a unit digital control system function known as the power–load unbalance (PLU) function on three generators. That function is used to prevent turbine overspeed instead of a mechanical overspeed protection system. The PLU logic senses turbine mechanical input in terms of steam flow and pressure and generator current output to quickly determine if the unit has suffered a complete loss of outlet paths, also termed 100% load rejection. The PLU function typically triggers a rapid closing of the steam valves, reducing the mechanical power input to zero, thereby preventing further acceleration of the unit. The valves are subsequently reopened if the PLU condition clears. Three units in Indiana and Illinois (totaling about 1,670 MW) experienced PLU operations, contributing to the severity of the generation loss for the Eastern Interconnection.
In this system disturbance, a very wide variety of system conditions was exhibited, rivaling the 2003 Northeast blackout in complexity. Initially, North Dakota, Minnesota, Manitoba, and Saskatchewan separated from the Eastern Interconnection with overfrequency in the resultant “island.” Saskatchewan then separated from Manitoba and North Dakota because of overfrequency conditions at the Boundary Dam generating station. New overfrequency protection systems had been recently installed at those generators, but the relays were left at factory settings (60.5–61.0 Hz), tripping 711 MW on seven generators, which were all well within the 102% overfrequency continuous capability of the machines. Although the machines were designed for their potential exposure to overfrequency in that part of the system, the underfrequency relay settings were not adjusted for those potential conditions. Undesirable tripping of the Boundary Dam generation because of these settings was deemed to be causal in the separation. As a result of the separation, SaskPower lost about 900 MW of load and generation because of underfrequency protection.
Resynchronization of Saskatchewan to the Eastern Interconnection was hampered by high voltages and large standing angles across breakers on open-ended tie lines between Saskatchewan and Manitoba. Those conditions caused the misoperations of some synchronization protection schemes.
The South Florida disturbance was initiated by a single-line-to-ground fault that was not immediately cleared because the protection systems were turned off by a technician during troubleshooting. As with all single-line-to-ground faults left uncleared, this one evolved into a three-phase fault that was finally cleared by remote protection systems. The disturbance included tripping of 25 transmission circuits, about 4,000 MW of generation tripping, and a loss of about 2,300 MW of load due to underfrequency load shedding (UFLS) relays.
Figure 4 shows the area (in blue) where the underfrequency of 59.82 Hz caused the UFLS system to trip about 2,300 MW of load across the Florida Reliability Coordinating Council (FRCC) system. The low frequency in northern Florida did not exceed the UFLS trip points and tripped no load.
Some important aspects and results of the South Florida disturbance are the following:
The South Florida disturbance marked the first use of PMUs across the entire Eastern Interconnection to show the wave-like behavior of the frequency, voltage, and current perturbations traveling across the interconnection. Figure 5 shows the temporal displacement and attenuation of the oscillations at different locations throughout the interconnection. An analysis of the oscillations at distant locations across the interconnection also highlights the wide range of potential impacts.
This system disturbance, which resulted in the loss of more than 7,800 MW of load and almost 7,000 MW of generation, was initiated by the tripping of a 500-kV line due to a series capacitor switching error. Over an 11-min period, the underlying transmission systems between Arizona and San Diego cascaded and eventually blacked out San Diego.
Some notable observations were as follows:
In this disturbance, the Washington, D.C., area experienced a severe, prolonged voltage sag caused by a protracted 58-s fault on the 230-kV system in eastern Maryland. The event was initiated by the failure of a 230-kV lightning arrester and protection system failure to isolate that fault. About 530 MW of load was lost because of customer’s loads automatically switching to backup power sources (such as the White House) and customer protection systems separating from the grid because of the low voltage. Almost 2,000 MW of generation was tripped during the event, including two nuclear units at Calvert Cliffs Nuclear Power Plant in eastern Maryland.
The protection system failure was due to failure to operate of auxiliary relay trip circuits on redundant protection systems; one system failed because of a loose connection; the other appeared to be an intermittent discontinuity in the circuit. Those protection systems employed earlier “teenaged” digital relays that required auxiliary relay circuits to handle the currents necessary to trip breakers.
The Washington, D.C., disturbance also uncovered the reverse power relay potential problems discussed earlier.
Beginning in 2016, the North American electric system began experiencing system disturbances caused by tripping of IBRs. For a significant period, the industry had believed the majority of IBRs to be limited to rooftop solar systems. However, the solar industry had moved well beyond the rooftops and had developed substantially sized solar plants (from 200 to 500 MW) in the desert Southwest. Those plants consisted of several IEEE Standard 1547-2004-compliant inverters installed en masse to create large solar facilities. However, the voltage and frequency ride-through requirements of that standard were insufficient to provide the needs of the BPS, where momentary cessation could not be tolerated as it could on the distribution system in small photovoltaic (PV) facilities.
Forensic event analysis was now facing the relative unknown realm of power electronics applied at utility scale (1 MW and higher). Power engineers had to learn different control modes of IBR operations:
Continuous operation mode: Actively inject current into the grid
Accurately representing IBRs poses an increasing challenge in event modeling because very little of these IBR dynamic behaviors are modeled in conventional power flow and dynamic simulations.
The grid operators also were faced with a world where SCADA scan rates are almost useless and even PMUs are too slow to capture the millisecond-level dynamic behavior of the IBRs. Similarly, the interarea oscillations that had been involved in several system disturbances are no longer limited to the oscillatory behavior of rotating machines (synchronous machines). Moreover, the interactions between inverters and other power electronics and controls are no longer confined to the sub-1-Hz frequencies dictated by the physics of the machines. IBRs are, after all, computer-controlled devices that can react in milliseconds to system conditions. If not well coordinated, the IBR response speed may destabilize the overall power grid; if used carefully, that speed can be used to tremendous advantage for system stability.
With the growing number of IBR facilities, there is an ever-increasing chance of control interaction and forced oscillations. If the frequency of the resultant oscillations is close to the resonant frequency (i.e., the natural oscillatory modes of the interconnection), the forced oscillation created by a 200-MW plant in Florida can be observed across the entire interconnection.
Also of concern is that the PMUs themselves, which have been invaluable in observing system oscillations and overall performance, cannot accurately measure perturbations far outside of their primary designed bandwidth.
NERC has published several technical reference documents, recommended practices, and issued industry guidance on the best practices for incorporation of IBRs into the BPS.
There are several system disturbances in the recent past where significant amounts of IBRs and other resources were lost (“tripped” or went into momentary cessation) when the IBR performance was undesirable for common system events (faults and switching). These include, but are not limited to the following:
Blue Cut Fire disturbance (2016)
A closer look at some of these disturbances can be instructional about the potential problems the industry is facing in event analysis. In all cases, there is very little, if any, point-on-wave high-speed data recordings to adequately analyze specific IBR behavior.
The Blue Cut Fire disturbance kicked off a rash of significant system disturbances caused by IBRs during what are considered to be normal transmission system operating conditions. That disturbance occurred during a forest fire in California that caused multiple line faults. The Blue Cut Fire itself caused 13 500-kV line faults and two 287-kV line faults. The most egregious was the fault at 11:45:06 Pacific Daylight Time.
Several distinct observations of this event are listed here:
Following this event, a detailed report was issued by NERC in June 2017, highlighting the use of “momentary cessation” by the inverters (cessation to energize). NERC also issued an industry recommendation on 20 June 2017 to remedy the use of momentary cessation on IBRs wherever possible and to limit instantaneous tripping for frequency perturbations.
The California fires continued to wreak havoc with IBRs in the electric system. This event included the following:
Following this event, a detailed disturbance report was published by NERC in February 2018 with a second industry advisory issued in May 2018.
This event involved
This event involved the following:
A joint NERC–Western Electricity Coordinating Council (WECC) report from August 2020, titled “WECC Base Case Review: Inverter-Based Resources,” highlights the problems of accurately modeling the IBR fleet. Figure 8 illustrates the differences between actual system measurements and IBR model performance.
It is obvious from those illustrations that the current models do not remotely reflect the realities of inverter performance.
To continue to monitor and improve the reliability of a BPS with the growing IBR penetration, there are several things that should be pursued in addition to the three-pronged approach being pursued by NERC: adoption of NERC Reliability Guidelines, improvements to the U.S. Federal Energy Regulatory Commission Generator Interconnection Procedures and Agreements, and enhancements to NERC Reliability Standards.
Remember, the new “smart grid” of tomorrow is only as smart as we envision it and make it happen today. What we do is not rocket science—it is more important!
“Major event analysis reports,” North Amer. Elect. Rel. Corp., Atlanta, GA, USA. [Online] . Available: https://www.nerc.com/pa/rrm/ea/Pages/Major-Event-Reports.aspx
“Event analysis, reliability assessment, and performance analysis,” North Amer. Elect. Rel. Corp., Atlanta, GA, USA. [Online] . Available: https://www.nerc.com/pa/RAPA/Pages/default.aspx
“WECC base case review: Inverter-based resources,” North Amer. Elect. Rel. Corp., Atlanta, GA, USA. [Online] . Available: https://nerc.com/comm/PC/Pages/Inverter-Based-Resource-Performance-Task-Force.aspx
Robert W. Cummings is a president and founder of Red Yucca Power Consulting, LLC, Albuquerque, NM 87111, USA.
Digital Object Identifier 10.1109/MPE.2023.3247050
Date of current version: 19 April 2023
1540-7977/23©2023IEEE