A. Ginter, Waterfall Security Solutions, Alberta, Canada
Cyber-securing oil and gas automation to ensure safe, reliable and efficient physical operations is becoming more challenging. This article explores the modern yet evolving threat environment, how expectations for cybersecurity are changing, how information technology/operational technology (IT/OT) dependencies have emerged as an important issue in most cybersecurity programs, and how owners and operators are addressing this new issue.1
Pivoting attacks. Remote-control pivoting attacks are the new normal for high-end ransomware and nation/state intelligence agencies. What are these attacks? Threat actors send phishing emails or social media messages to employees of a target business, teasing out remote access credentials or offering malicious attachments to click on. Sooner or later, a targeted insider succumbs to the attack.
The adversary then uses the stolen credentials to log into the victim’s network and install malware or uses the malicious attachment to download the malware. In both cases, the malware is a remote access trojan (RAT). The RAT connects to an internet-based command and control center (C2). The attacker logs into the control center some time later and uses the connection to the RAT to send attack commands to the compromised IT computer. The RAT malware executes those commands and sends the results, or even screen images, to the attacker over an encrypted connection to make intrusion detection more difficult.
In short, the adversary uses the initial compromised IT equipment to attack other equipment—this is called “pivoting” the attack. When pivoting through the IT/OT firewalls and exploring the operations/automation networks for pipelines and refineries, the firewall often forbids direct connections to arbitrary addresses on the internet. However, most RATs can daisy-chain their communications back through other compromised machines in the pivoting chain until those communications reach a machine that is in direct communication with the C2 and human attacker.
This kind of attack is robust, routinely penetrating layers of firewalls between the IT network and the internet. Today, many IT/OT administrators assume that all equipment that is reachable from the internet via a pivoting path is at risk of eventual compromise.
Engineering, not hope. This is a serious problem for automation networks, where worst-case consequences can be catastrophic. In many cases, disruption of an industrial network can bring about unsafe conditions that pose material threats to workers, the public, the environment and/or national security. What is meant by national security? The legal definition of critical infrastructure is “critical to the nation.” Any attack that puts the reliable operation of critical industrial infrastructure at risk puts national security at risk.
As employees, citizens, lawmakers and members of society become more aware of the ubiquity and power of pivoting attacks, we are all becoming increasingly concerned. For example, there was widespread surprise at the 2021 Colonial Pipeline incident because a cyber-attack impaired the critical infrastructure in an industry that most of society assumed had “solved” the cybersecurity problem long ago.
Most citizens assume that engineering teams in all industries have addressed their industries’ material risks, including cyber-attack risks. After all, this is why engineering professions are regulated in so many jurisdictions—engineering malpractice can put workers, the public, the environment and entire nations at risk.
For example, imagine that engineers have learned how to build suspension bridges at one-third the cost of today’s bridges (using carbon fiber or some other innovation). Imagine, however, that the new design has a problem—the new bridges are riddled with harmonic frequencies. People simply walking across the bridge are enough to cause oscillations that grow unboundedly, potentially to the point of tearing the bridge apart. To address this risk, imagine that engineers designed active hydraulic vibration dampeners into the bridges. With multiply redundant computer-controlled dampeners and multiply redundant power supplies, the bridges feel “rock solid.”
How secure would each of us feel driving across such a bridge every day if we knew that the design engineer for the bridge “hoped” that, if there were a cyber-attack on the damper automation, we could detect the attack before it crippled the automation? How secure would we feel driving across the bridge every day if we knew the design engineer “hoped” that if we detected the attack, we could scramble an incident response team fast enough to prevent a disaster?
“Hope” is not what we expect of design engineers. We expect bridges to be designed to carry a specified load, in a specified operating environment, for a specified number of decades, with a large margin for error. For example, if we ask a civil engineer how they determine what load a pedestrian bridge must support, we often get a simple answer: design the bridge with barriers at either end so vehicles cannot enter, and the maximum load will be people. People are normally less than 2 meters (m) tall and are mostly made of water. Thus, the maximum load the bridge must bear can be modeled as 2 m of water the entire length and breadth of the bridge. Engineers then multiply that maximum load by eight and design the bridge to carry the multiplied load. When the “load” is people, the failure of the bridge under load is unacceptable.
In the face of a pervasive threat of pivoting attacks, more people are starting to expect that when engineers design critical industrial automation systems, those systems reliably carry a specified threat load until at least the next opportunity to upgrade the security designs, with a large margin of error.
Dependencies. These new expectations, both societal and professional, are driving engineering teams to re-evaluate their cybersecurity programs, and a new emerging problem is dependencies. When dependencies are examined, it is determined that many industrial automation systems that control and operate pipelines and refineries depend on IT assets for continuous operation. This is a problem because IT systems are nearly universally exposed to pivoting attacks from the internet, one way or another.
What are these dependencies? The textbook example is when Maersk, one of the world’s largest container shipping businesses, was compromised by NotPetya in 2017.2 NotPetya crippled Maersk’s networks, stopping almost all the company’s container shipping. Why? It was not because the malware crippled any of Maersk’s ships’ navigation or propulsion systems—ships continued to sail into ports on time, perfectly safely. It was not because the malware crippled any of the ports’ cranes, which were still able to remove containers from the ships and deposit them gently and safely on 18-wheeler trucks. The stoppage came because the company’s container tracking system was hosted in the IT network and the entire IT network was crippled. Without that system, it was no longer possible to print waybills for the truck drivers, instructing them where to deliver the containers. Consequently, all shipping stopped.
In the oil and gas industry, physical operations that depend on IT systems are common. Pipeline operations may depend on a functioning IT-based custody transfer system. Refinery operations may depend on a functioning IT-based laboratory information management system (LIMS). OT-critical communications may be piggy-backed on IT-based and IT-managed communications infrastructure. It does no good for engineering teams to design the world’s most bullet-proof industrial automation system if they need to shut down the physical process every time ransomware creeps into the IT system and impairs any function that OT systems or physical operations rely upon.
U.S. Department of Homeland Security Transportation Security Administration (TSA) pipeline security directives. The TSA SD-2021-02(A-E) series of security directive includes several new rules that are unique to any other standard or regulation.3 To start, the directives point out very clearly that the goal of the OT security program is to keep the pipeline running at the necessary capacity, even if the IT network is compromised. It seems obvious that this is the implied goal of all industrial security programs, but the TSA was the first to put it into words.
A second unique requirement demands that in a cyber emergency, OT networks be completely disconnected from IT networks—when OT networks are rendered or determined to be free of infection, the physical process can start again, no matter how long it takes to clean up the IT networks. To this end, owners and operators are required to take inventory and assess any dependencies on IT systems and services to understand how safe physical operations might be restarted when IT and OT networks no longer communicate. Saying “restart OT even when IT is compromised” is easier said than done.
What other industries do. How do other industries address this need? The North American Electric Reliability Corp. Critical Infrastructure Protection (NERC CIP) standard governs cybersecurity for automation critical to the national power grid and says nothing about dependencies. However, there are practically no dependencies on IT resources anywhere in the power grid. How can this be?
A guiding principle of the CIP standards is that every cyber asset, (computer, switch or device) whose failure or mis-operation could impair any material part of the grid within 15 min is in scope for some or all CIP rules. Since the CIP standards can be expensive to implement, most owners and operators in the grid take great pains to minimize the number of assets able to impair power grid operations.
A textbook example is Windows Active Directory (AD) controllers. IT practitioners might propose that OT systems in power plants use the business’ IT-based AD infrastructure because it would require money and effort to set up and maintain a parallel OT AD infrastructure. However, if OT uses the IT AD infrastructure, then all IT AD controllers come into scope for most of NERC CIP. This is, of course, because the mis-operation of any AD controller could propagate problems throughout the AD system, very quickly impairing the automation controlling the power plants. In particular, CIP-005(R5.2) demands that owners and operators “identify and inventory all known enabled, default or other generic account types.” In a large organization with tens of thousands of employees on the IT network, this is an onerous and very expensive task. In a parallel OT AD system, only the OT users would need to be tracked this thoroughly and audited periodically. To the well-meaning IT teams asking us, “Do you know what a parallel AD infrastructure for OT would cost?” CIP compliance teams reply, “Do you know how much a shared AD infrastructure will cost, compliance-wise?”
The costs of CIP compliance efforts are such that most entities critical to the power grid have truly and thoroughly eliminated all OT dependencies on IT infrastructure.
Practical solutions. Is eliminating dependencies practical in midstream and downstream operations? Sometimes it is, and other times not. When it is impractical, what can be done about these dependencies? One solution for many kinds of dependencies is to manage three kinds of networks with the cybersecurity program: the business (IT) networks, industrial (OT) networks and one or more reliability-critical IT sub-networks.
For example, consider a refinery with an LIMS system that is essential to ensuring that the products produced meet contractual quality standards, and with contractual obligations that do not permit shipping the product without reporting the quality measurements to the contract-holders. In many businesses, it is impractical to host the LIMS system on the OT network due to the tight integration needed between the LIMS system and the enterprise resource planning (ERP) system. Thus, if a cyber-attack cripples IT systems—including the LIMS system—most, if not all, of the refinery must be shut down until LIMS functionality is restored, even if OT assets are essentially invulnerable to these pivoting attacks because of unidirectional gateway technologya or other network engineering tools are deployed at the IT/OT interface (FIG. 1).
To minimize this downtime, the dependencies must first be understood. Assuming the LIMS system is the only IT asset OT depends on, the LIMS system is moved into a rapid-restore sub-network (RRSN) and the dependencies must understand what other systems the LIMS system depends upon. Operators must either eliminate the second-level dependencies or move the dependent systems (e.g., the ERP system) into the RRSN as well. In this example, it is assumed that the LIMS system did not depend on the ERP, but rather that the ERP depended on a functional LIMS system. However, imagine that the LIMS did depend on the AD. Rather than pull all AD systems into RRSNs, the LIMS was disconnected from the AD and made to manage the LIMS accounts and passwords separately, not as part of the single-sign-on AD system. The result is that refinery operations depend on the LIMS in the RRSN, and we eliminated every dependency of RRSN systems on external systems.
The next step is to train the incident responders. The next time ransomware cripples IT systems, incident responders should not need to be concerned about the safety-critical automation. Network engineering at the IT/OT interface eliminates the risk of pivoting attacks propagating into OT systems. In this example however, the refinery is shut down because the operators could neither produce nor ship product without quality sampling and analysis.
Therefore, incident responders must be trained to attend very quickly to RRSNs. If the LIMS system is hosted as a virtual machine and writes a copy of all LIMS transactions to a WORM drive, incident responders can be trained to quickly restore the LIMS to the last-known good virtual machine snapshot, replay the transaction log into the system and have the LIMS ready to continue within an hour. Incident responders must discover how the attackers initially propagated their attack into the RRSN to compromise the LIMS system and eliminate the potential of re-compromise. With that done and the LIMS restored, refinery operations can be restarted. Furthermore, the incident response teams can take as long as they need to repair the remainder of the IT network, without impacting physical operations.
Takeaways. This approach to dependencies works better, worse or not at all, depending on the situation. The approach will not help if the physical operations depend on IT systems, such as communications infrastructure, that are too big and dispersed to put into a sub-network. When this method works, it works better for pipelines that can go from dead stopped to full production in half a day, compared to refineries that can take a week or more to go from full stop back to full production. Therefore, when the method works, whether for rapid-restart pipelines or slower-restart refineries, it is always faster than restoring the entire IT network before restarting the physical operations. RRSNs are one example of the kind of innovation the world needs as the volume, consequence and sophistication of cyber-attacks continue to worsen. HP
NOTE
Waterfall’s WF-600
LITERATURE CITED
Waterfall, “2024 threat report: OT cyberattacks with physical consequences,” online: https://waterfall-security.com/ot-insights-center/ot-cybersecurity-insights-center/2024-threat-report-ot-cyberattacks-with-physical-consequences/
Industrial Cybersecurity Pulse, “Throwback attack: How NotPetya ransomware took down Maersk,” September 2021, online: https://www.industrialcybersecuritypulse.com/threats-vulnerabilities/throwback-attack-how-notpetya-accidentally-took-down-global-shipping-giant-maersk/
TSA, “Memorandum,” July 2024, online: https://www.tsa.gov/sites/default/files/tsa-security-directive-pipeline-2021-02e-and-memo-508c.pdf
ANDREW GINTER is the VP of Industrial Security at Waterfall Security Solutions, where he leads a team of experts who work with the world’s most secure industrial enterprises. Before Waterfall, he led the development of high-end industrial control system products at HP, IT/OT middleware products at Agilent and the world’s first industrial SIEM at Industrial Defender. Ginter is the author of three books on IT/OT cybersecurity. He co-hosts the Industrial Security Podcast and contributes regularly to industrial security standards and best-practice guidelines.