This blog post is an outgrowth of a topic I quickly waved my hand about at S4x20. Glenn Merrill reminded me that I hadn’t really followed up on it. It deals with the built in self integrity and diagnostics features found in most Programmable Logic Controller (PLC) gear.
First and foremost, the PLC vendors were thoughtful enough to provide these features and diagnostics for you. I advise you to make the best of these features to monitor the self integrity of the PLC and its peripherals.
Let’s discuss some of them:
A First Scan feature isn’t really a diagnostic. It is a behavior that nearly all PLC equipment have in some form. It is basically one or more flags, or a designated routine that is executed on the first scan of a PLC after it “wakes up.” Read the instructions on the PLC carefully because there will be different features for every model of PLC. The first scan features are not standard in any way, so pay attention.
Some PLCs may have error codes from the last case where the PLC faulted or shut down improperly. You may want to record those errors and then clear them. It might be a good idea to report those errors to the Human-Machine Interface (HMI) as informational data or perhaps to a syslog server, if those features and that infrastructure exist. The integrator/control systems engineering staff should know what to do with that error. If they ignore it, or dismiss it without investigation that’s a big clue that you might want to have the PLC code reviewed by a third party or that you may want to seek the services of another integrator.
Next, while we’re used to thinking about what a controller does while it is running, we do not usually think about how it handles startup up with a running process. Yes, it is possible the PLC could be restarting during a running process. Make sure that the process it controls is restart-safe. If it isn’t practical to configure the PLC to restart-safely, be sure that it alerts you to this fact and that it does not issue any new controls. Also, for that case, please ensure that the Standard Operating Procedures (SOP) have very clear instructions for setting the manual controls so that the PLC will start up properly. Note: This also includes configuration of remote I/O panels. I’ll have another blog soon about that topic.
Why does this matter for security? If something commands a PLC to restart in the middle of a working process we should expect the program to pick up smoothly with minimal disruption to the process. Anything else will become an attack vector.
Keep track of the PLC “key-switch” state. Too many people leave these keys in REMote state where anything on the network can cause the PLC to change state. This is not good practice unless the plant is in some kind of testing or development phase –in which case, it should be isolated from higher levels of the network
Most of the time the key-switch should be in the RUN mode. If it is not, make sure the operators know that it isn’t in that state, and put an alarm on the screen for them. If they know that someone is supposed to be working on that control system, they can acknowledge the alarm and move on. The HMI should be configured to re-nag them toward the end of the shift about the presence of the alarm. The goal should be to keep track of any staff or contractors on the plant doing work that might affect the process.
If the key switch moves when nobody is supposed to be working on the PLC, grab the biggest pipe wrench you have (I have many years of experience at a water utility where we had pipe wrenches the size of baseball bats), and go have a polite conversation in the most imposing manner possible.
The cycle time is the time it takes to compute each iteration of logic for the PLC. The iterations are the combination of Ladder Diagrams (LD), Function Block Diagrams (FBD), Instruction List (IL), and Structured Text (ST). These logic components may be joined together with the Sequential Function Charts (SFC).
This sounds quite complex, but LD, FBD, IL, and ST all execute very quickly. The time consuming part is probably the measures required to flush output to the local I/O and/or the Ethernet I/O and then read new I/O.
Modern PLCs will typically have cycle times with most programs in under 10 milliseconds. I mentioned a rule of thumb at my presentation at S4x20 that if the cycle time approaches the power line frequency of your region, it is an indication that you should consider breaking control of the process up in to smaller computational units and distributing them.
Please do not take what some vendors say about being able to run an entire plant on one processor. Can you do it? Sure, and it may even result in less coding. But is it a good idea? Not if you care about resiliency. Note that I have absolutely no stake in any automation vendor. I’m not trying to sell you more hardware. The goal I pursue is resiliency. There will be more about this subject in future blogs.
It is good to monitor and report PLC Cycle Times on a graph because it is an indication that the PLC is doing something new or different. Most of the time, these programs will be very consistent and stable in their cycle time –or they should be. Think of it as the tachometer on a car or truck engine.
Reasons why it might change include the following: Someone activated a new program or is testing new code on a live PLC, a new process recipe is loading, or the PLC is using auxiliary logic to send messages or activate some feature.
If the cycle time of a PLC changes and there is no good reason for it, call the controls people in to investigate.
If the PLC communications are going bad, you shouldn’t wait until the last minute to discover it the hard way. Frequently, there are warnings of slowly increasing error rates. This can help you to determine if the problem is a connector or cabling problem. There are registers in many PLCs for gathering error packet statistics.
It is also important to monitor all ports. Ports that should be down should stay down. Ports that should be up and running should stay up and running. If they do anything else, an alarm should go off. So if there is now activity on a serial or USB port on the PLC when nobody was supposed to be working on the site, go investigate.
It is also a good idea to profile the traffic on the network and report the levels of traffic the PLC sees. Compare this with what the switch counters say for that port. There could be a lazy contractor or technician who has inserted another switch or cellular modem without alerting anyone to what they did. Not everything will be detectable from the switches, but the PLC itself shouldn’t lie to you.
With the advent of wireless I/O networks, you should take extra care to monitor the traffic on that network. If a new node shows up, be certain that it belongs there. If the signal to noise level decreases, alarm on it. If the received signal strength of a device drops or if the mesh paths change, alert people to what is going on. It means there may be something odd going on.
Do note that radio signals can and do change even with direct line of sight paths. The signals from up high on a tower may not always be at the same signal strength to a building 1 km away. Radio propagation issues are beyond the scope of this discussion, but do note that it possible and that I have personally observed this happen. So this may not exactly be related to security, but it is part of what you’ll need to know about possible attack vectors.
This is very basic, but you’d be amazed at how many people don’t pay any attention to this: Keep track of how long the PLC has been alive since boot. If the PLC has been restarted, you’ll know when the time since boot resets. Make sure the HMI alerts you to any sort of PLC restart like this.
If your code was working fine, and suddenly does a divide by zero, what should happen? Most programmers will trap and ignore the issue as a math error and move on. Worse, yet, they presume their code is perfect and let the PLC enter a hard fault state. Don’t do that. If something is communicating with you peer to peer from another PLC and you just got a zero when it wasn’t expected, INVESTIGATE. Something is going on. It is probably a programming or I/O error from somewhere, but you’d better identify it and investigate it before anything happens. While it is probably benign, it may also be someone poking around at places where they shouldn’t be.
This is a very short overview of some of the more common self-integrity features a PLC has. As I pointed out at my S4x20 presentation, the better the code in the PLC, the easier it will be to spot a problem and detect a hack. It seems that while few have any will to add these features, they can do a lot improve availability, and earlier detection of rogue activity. I know that it is not easy to do when you’re the integrator at the tail end of a large capital project. But if you don’t put these features in, who will?
And customers, especially those of you in the pipeline and water utility business, learn to ASK For these features.