Virtualizing a PLC?

In the fourth season of the cartoon sitcom The Simpsons, there was an episode where the town was flim-flammed by a salesman pitching a Monorail for the town. Everyone saw it as a great idea, but nobody could say why. Marge Simpson had her doubts, and of course, she was right. It didn’t work out.

That’s how I feel about this proposal to place a PLC on a cloud based platform. It makes no sense at all. But it sure has a lot of people talking about it as if it were inevitable.

A very similar effort was tried about 25 years ago and it failed in multiple implementations among various vendors. It was the PLC on the PC concept. The problem is that many other things were also going on in that PC. Even though the PLC process was given priority even higher than the OS, and could be given processor affinity and full time processor availability, it still didn’t do well. Why?

1. The PC is not in the field. It’s usually in the administration building. The PLC process has to coexist with a lot of other applications. The network connections to get the I/O traffic to that office or server were not particularly reliable.

2. If the OS crashes, yes, the PLC can continue to work most of the time (I actually saw this failure mode). But that’s not good enough. Even if the PLC continues to run, someone is going to want the rest of that PC to work. This results in a full boot-up sequence, during which the PLC becomes unavailable.

3. PC hardware is not hardened environmentally. So if the building HVAC failed over the weekend, the PLC software might not stay online because the building itself was too hot for the PC to operate.

Now let’s review why we put the PLC in the field to begin with.

Networks are not reliable enough. Yes, the components themselves can be hardened. Yes, they can be pretty capable. But if you add a switch here, a firewall there, a fiber optic trunk, another router, and then a firewall to get to the VM host, your potential for failure is still higher than if you just stuffed a PLC in to that field cabinet.

Let’s suppose you have on-premise cloud hardware. A reliable VM host (and UPS) is probably more expensive and power hungry than most of those PLCs, but let’s set that aside for a minute. Your latency will be significantly more. Keep in mind that the network extent to the I/O is greater. From the admin building to the site, you may have many failures: Flooded conduits, Electric faults, Fires, or even a misidentified cable where someone disconnects the wrong thing. Oh, but there is redundant networking, right?

Here’s where I put on my hat of experience. I’ve seen redundant networks. Do you know where they get installed? They usually traverse the same conduits or conduit banks as the other network cable. So if something damages one cable, it will probably damage the other as well. It takes great discipline and some serious knock-down/drag-out fights to ensure that the conduit schedule allows for truly redundant networking paths.

Once in a blue moon you might encounter someone who really gets it and has the conduit schedule to make such networks work. But most of the time, they succumb to the whining from the finance, general contractors, project managers, and electrical firms that it’s just too expensive.

Let me reiterate one thing about using cloud resources: Do not use Internet based cloud services for ICS! Internet cloud services are designed for serving traffic at high bandwidth to the rest of the Internet, but not necessarily you. It is optimized for resiliency of a web site.

Your ISP can have all sorts of local, regional, or even national failures –and if the PLC software element is still running, it doesn’t matter, you’re still down. Furthermore latency is not always low.

Many people confuse bandwidth with latency. Yes, you can communicate at 500 MBPS, so it must be good, right? WRONG! If it takes 50 mSec to get to the computation area and it takes 50 mSec to get a reply, you’re looking at a latency of 100 mSec. That’s not a good response time for motor controls. For reference sake, an AC cycle is either ~17 mSec (rounded up) in North America or 20 mSec in Europe. If you have a motor control fault, you’ll probably want other things to shut down as fast as possible. Most scan cycles of PLCs are in the single millisecond range. Local communications from the PLC to the I/O is also in milliseconds. Cloud usage for a PLC platform is a step backwards, even if the ISP is doing everything right.

I’m going to bring up another argument that may sound awfully familiar: Internet cloud resources are not deterministic. Let me reiterate that: there is no guarantee that your latency will remain the same. It can get pretty high, especially if there are traffic storms between you and the cloud provider. I can’t stress this enough: Do not use Internet Cloud for an application like this.

Furthermore, loss of I/O communications isn’t the only concern. The administration building may lose power, but the process zone may still have it. Remember that many plants have substations and motor controls on various connections besides the one that feeds the administration building. Do you want your controller to have one power-loss mode of failure, or several involving various network hardware scattered across the plant?

Keep in mind that you’re comparing the reliability of Virtualization, as good as it may be, to a single hardened PLC lurking in a cabinet right in the middle of the process it controls. The PLC has a few minor failure modes. The I/O network back to the VM host can have MANY failure modes, many configurations where things can go wrong, more opportunities for hackable targets, and more cables between the controller image and the process it controls. Your mean time to repair (MTTR) will be higher with an extended I/O network to an on-premise cloud than it would be if you had a PLC sitting on the shelf ready to slap in to a cabinet.

Also the process impact will be higher if the PLC loses communications with the I/O in the field. My hard-won experience of 30 years is keep the PLC as close to the process it controls as possible. Use lots of little PLC systems rather than a single large one. This will limit the impact of a failure.

Okay, those are the disadvantages. What about the benefits? Patching might be easier, right? Uh, not so fast. You can patch today with two PLCs in the same rack. Most higher end platforms, such as the ControlLogix from Rockwell, have redundant processor managers. It is possible to upgrade the firmware on one processor, synchronize the application and logic states, and then transfer control to the new processor. We’ve had this ability for more than 30 years. It is well known and well understood.

Yeah, it sounds like a bit of work, right? Again, not so fast. You’ll be doing something on the plant anyway. Someone has to be in the field to confirm that the process is working correctly and that there are no problems. Remember, this is not about the PLC image continuity, this is about the Process continuity.

If you naively think that you can sit there in your air-conditioned Valhalla and push a change like this in virtual land, dream on. You’d better have operators standing by to handle any problems. Does the virtualization really save anything? I don’t see it. The only thing it changes is where you sit when you do this. I strongly recommend standing near the process, not several buildings away. YOU should be there to face the music if anything goes badly.

Another thing: Reverting the image of the PLC back to the older version may not clear up the problem. Some motors may not be able to restart without a sequence of other physical things happening. Meanwhile the physics, chemistry, and biology are continuing to happen.  The people who brag about being able to flip back and forth between old and new PLC images are completely missing the point. The PLC is not what’s at stake here. It’s all about the process resiliency, not the PLC resiliency.

Could you back-up the PLC? Sure. You can do that now while it’s running, so no big benefit there either.

Could you step the PLC through each ladder rung of instructions? Sure. That’s true of both.

Well, what about IIoT and all that stuff? Again, how much network hardware would you like to see between the PLC and one of these remote devices?

Can you communicate with it better to gather more data? Yes. You can in the VM environment. But is that a good thing? Do you really need the ability to poke at all the state information points in the PLC? I contend that you really shouldn’t.

In reasonably modern PLC gear, such as the Bedrock Automation platform, you can actually designate which points you want to be readable,  writable and inaccessible from protocols such as OPC-UA.

Can the PLC execute faster in a VM? Almost certainly yes. However, execution time isn’t what we’re worried about. It’s latency getting in and getting out of the image. How much delay exists between a PLC in a rack versus an image in a VM host in the administration building?

Can the PLC VM image be more complex? Sure. But why would you want that? As an engineer I live and breath by the KISS principle. If something isn’t absolutely necessary, I won’t recommend or design it. Having super complex control strategies is not a good thing. The operators and engineers need to be able to predict what to expect next. To do that, you need a simple control scheme.

And finally, what ELSE is on that VM Host? How often does the virtualization software need to get patched? What other images are on there that might cause problems? What’s the real transfer time to a mirrored VM Host? How about a software defined firewall or router? What happens if someone misconfigures the software defined networking in the VM host; or perhaps they get confused between two PLC images and change the networking on the wrong image? I’ve seen this happen on many office applications, and I see no reason to think that it can’t happen in these images.

I say all this as someone who was an early adopter of virtualization technology for plant HMI platforms. There are sound reasons why the HMI is a prime candidate for virtualization. But virtualizing a PLC is not a good thing. Virtualization adds a lot of extra complexity to a PLC image for very little return on the investment. I cannot think of a single reason to recommend this technology.

Ignore the fanboys. This concept is doomed.

http://www.infracritical.com

With more than 30 years experience at a large water/wastewater utility and extensive experience with control systems, SCADA, RF and microwave telecommunications, and DNP Technical Committee membership, Jake still feels like one of those proverbial blind men discovering an elephant. Jake is a Registered Professional Engineer in the State of Maryland. He is currently a Senior ICS Security Engineer at Jacobs.