Introduction
I may have given the impression in an earlier blog that the security people are the cause of the recent resurgence of interest in retro technologies of control systems. They’re certainly a symptom, but the disease is much larger than that. I have said this many times: If you want a reflection of how well a company’s doing, look no further than the IT department. This is also true for control systems management.
Growth
Industrial Control Systems have become terrifyingly complicated. In 1988, I worked on a SCADA system built around a PDP-11/73. It was a sixteen bit computer with 2 MB of RAM, an 80 MB Hard drive, running at what was then a blazing speed of 15 MHz. The WAN interface ran at an unbelievable 56 kBPS (via a Group-Band modem for those of you familiar with Frequency Division Multiplex microwave systems). Our SCADA system had 5 communications lines running 1200 baud Bell 202 modems on party line circuits. This system handled about 70 RTUs and there were approximately 750 points being sent in from the field.
Years later that SCADA system has Megabit per second WAN speeds going everywhere, nearly 200 RTUs, and about 7000 points from the field. However the servers live in a virtualized environment with many TB of storage, tens of Gigabytes of RAM per VM Host, and multi-core processors that run at GHz clock speeds.
Notice anything weird here? Yes, the system has grown: communications speeds to the field has increased; Point Count has gone up by an order of magnitude; the number of RTUs has tripled. But none of this explains the incredible explosion of computing hardware and Operating System size and complexity, not to mention history data gathering.
Operationally, yes, things have changed and grown. However, they’re not radically different from what we did decades ago.
Yet, computing platforms we’re using have exploded vastly out of proportion to what we’re doing today. What’s going on here?
- Operating systems and the applications that run on them are bloated with useless or even dangerous features.
- We’re gathering non-operational data in real time. The operators don’t know what to do with this information. It confuses them.
- The data resolution has gone up insanely for no practical reason. In fact, it has grown so much that alarm management is a problem. More on this in a bit.
- We’re gathering data without much thought of purpose or philosophy.
Because of all this bloat, we’re no longer aware of everything that is going on in a control system. There are numerous places for malware to hide, too many possible failure modes, and in all this complexity, we’re using MORE complexity to monitor the integrity of an already complex system.
Causes
The earlier computing systems were much like the early days of steam energy. People would build massive stationary machines and these would be used for multiple purposes. At the end of the day, it was a simple arrangement that reasonably educated craftsmen could wrap their minds around. That PDP-11 system was analogous to those early stationary steam engines.
But today, processors, memory, and storage are less and less expensive while performance continues to climb exponentially. So we build multiprocessor systems. We build applications full of objects and methods borrowed from operating systems inside operating systems. We build big sloppy networks with a mix of speeds at up to 10 Gigabits per second speeds. Why? Because that’s what the office people are doing. We virtualize the hardware with insane complexity, we build massive databases with enormous indexing overheads. We gather data at much faster rates because we have the bandwidth to do it.
Many years ago, to remain compatible with the office, SCADA system designers started using personal computing machinery and software (MS-DOS and Windows) that was optimized for office applications, not a control system. They did so because the performance was increasing very rapidly, and prices were low enough that even if it failed, it was less expensive to replace than the dedicated designs.
The move to Windows and x86 was also done because the standards and file formats for communicating with office applications didn’t exist. Literally the only way to generate a spreadsheet-compatible file was to load the spreadsheet application on the control system. Thankfully, the compatibility issue is not a problem any longer. But inertia is still a big deal and we’re still using office grade equipment, technologies and polices in a control systems environment where the economic imperatives are not as enticing as they used to be.
Another theory behind using office equipment is that people who understand that technology were supposed to be easier to find and less expensive. Yet the complexity of these systems is so high that we’re no longer dealing with one or two people to maintain these systems, we’re dealing with a significant chunk of an IT department. Are we still saving money here?
In fact, the office has gone places that control systems shouldn’t. In office applications, the impact of a crash or a reboot are minimal. So software and hardware have sacrificed reliability, maintainability, and security for performance. They have chosen to build complexity on top of complexity to the point where it’s nearly impossible for any one person to know everything that’s going on inside a system. Is this really necessary?
Alternatives
Before we begin this discussion, I’m not suggesting that we should go back to using PDP-11 hardware. Like the old stationary steam engines, it was not an efficient platform. What used to take kilowatts of power, significant racks of equipment, and LOTS of wiring and leased line services, can now be replicated with very inexpensive, low power, single chip computing with inexpensive networking that is already in place.
However, the complexity is a problem. And because it is so complex, we can’t know what’s going on in all the dark corners of these computing platforms. So security is a problem. On the other hand, Compatibility and Performance are not the problems they once were.
What we need are simple systems with comprehensible software. Even a Raspberry Pi would be more than enough for many of these applications. We need to simplify our systems so that they’re comprehensible once more.
Another thing is to consider redistributing the complexity toward the field. IT departments are notorious for promulgating standards that involve big glass rooms with lots of servers and heavy air conditioning loads. This disease is starting to infest control system designs. Control systems need to be survivable and distributed, not centralized.
And in fact, we’re already doing this with smart instruments, variable frequency drives, and many more embedded technologies. Side note: IIoT is just another marketing name for this embedding of functionality.
The goal for this distribution is to limit the computing power in each device and to thoroughly audit the code in it. We don’t need multi-gigabyte operating systems with virtualization inside virtualized platforms, running Kubernetes packaged applications on a processor with such large instruction sets that people actually go looking for the undocumented instructions to find secret back-doors in to the memory management systems.
We need to ditch the baroque operating systems, get a clean sheet of paper, and start over.
What might this look like?
1. Eliminate the plug-and-play philosophy. Industrial Control Systems are basically static beasts. They don’t change very often and when they do it’s usually accompanied with physical changes on the plant. Remember when we had to compile new drivers for an operating system and then determine major and minor numbers for each so that you could create handles in the /dev directory to use them? In a control system you rarely ever had to do that twice. What’s wrong with going back to something like that?
2. Eliminate unnecessary processors and storage places. No USB flash drives. Back up files to another storage array on the other side of the plant or distribution system. If both sides are destroyed you will have much bigger problems than just managing the control system.
3. Eliminate the self-configuring network protocols. The system is supposed to have records on where the stuff it needs is. If the record doesn’t exist, then it’s not a legitimate resource. Self-configuration features are wonderful things to an integrator, but frankly, an operator probably doesn’t care. Their goal is to keep things consistent and similar as possible to what was there to before.
4. Strip down the API to something basic. Don’t give the User space things or features it doesn’t need. The more bloated the API gets, the more opportunities there are for someone to break in.
Network Resources
Keep in mind that SCADA systems for critical infrastructure are expected to be survivable and reliable. Bandwidth is NOT a major criterion. Yet there is no shortage of fools who will tell you that you can run infrastructure SCADA operations over a VPN on the internet. Their argument is that nothing is more reliable than the Internet. That argument is irrelevant. While it may be true that the Internet itself doesn’t go down much, the Internet Service Providers are a different story. They frequently have problems but rarely ever publish these issues because there are few legal requirements for them to do so.
SCADA telecommunications are often handled by radio, by hard-wired telephone lines, private fiber-optic connections, private microwave back-hauls, and so forth. It is not wise to overload these resources with “real time data” requirements from analysts who do not understand what they’re asking for or what resources they’re using.
Perhaps there should be a data gathering network so that analysts can have their strategic data without burdening the SCADA system with so much tactically insignificant stuff. They can build it however they like and if it fails in some way, it isn’t likely to stop anything that matters.
Limiting Complexity
This complexity insanity manifests itself in other ways too. Data is being gathered at a rate that makes no sense. There is a serious hazard of overloading operators with too much information at once. If alarms are not managed properly, it can lead to extreme disasters. For example, this was one of the principle reasons why situational awareness was so poor during the King County Waste-water treatment plant flood. (See Section 3.1.1.4 of the AECOM post-disaster report)
Managing complexity must start at the field. Control Systems Engineers need to grow a backbone and start explaining the real cost of gathering tactically useless information. Deadbands should be set so that they keep people posted on what’s happening without insane floods of random data.
Many analysts will attempt to summarize data at the operations control center by totaling or averaging values from the field. So instead of integrating flow from every minute by minute over a month at the HMI (along with any glitches or gaps the data may have), they can ask for daily totals from the field integrated every few hundred milliseconds. It will be more accurate, the control system doesn’t have to transmit ALL the data, and there is less need to confuse people with high resolution information.
Alarm processing should go all the way out to the field as well. If the power goes out, it may trigger many other alarms that are utterly irrelevant. For example, the limit switches on a large valve may go in to error mode because the power that wets the switches isn’t there any more. If the power goes out, those limit switch alarms should be suppressed. At the end of the day, the operators don’t want all those extraneous alarms. They want to know what broke, and who they should send to fix it.
Now that’s not to say that we shouldn’t have such alarm data at the site. We should! But there is no need to send it off-site. It should be there for analysis by the on-site technical staff if needed. Rugged, offline storage is cheap now. We can afford not sending everything back to an operations center. This is also useful for forensics after an event.
In fact, the majority of data collected is not worth sharing or recording. Leave it at the site if anyone wants it, and let it get over-written after a month, a year, or whatever.
What SCADA Stands For
People seem to forget that the first letter of SCADA stands for Supervisory. We should not send all the data, just the information that is of tactical interest to the operators. I’m not suggesting that we not collect the other data, but that we collect it, store it on site as needed, and perhaps use a different networking infrastructure to forward that data to those who may care about it.
The reason why SCADA systems are getting so bloated and complex is because the primary goal of tactical operational needs are being ignored. People are asking for all sorts of “nice to have” data that, while it may not be terribly expensive or onerous, are becoming death by a thousand cuts.
The more complex we build these systems, the more difficult it is to maintain and defend these systems. If things get too complex, the operators and their managers will start looking for simpler ways to automate. Remember, SCADA exists to operate things economically, not to provide data-holic analysts with lakes of questionably useful data.