Dale Peterson says let’s not continue to wave our hands about the use of Cryptography in the lower layers of control systems. I agree. He’s proposing that we build on Cryptography use cases as they are known now. That’s a start, but this is where most people reach the end of their knowledge and then find themselves in subtle and dangerous territory.
In this blog I will elaborate on the obstacles as I know them. They’re not without solutions, but these solutions are not related to the Cryptography. Rather the industrial processes that they control need extensive overhaul to be compatible with Cryptographic methods. This problem touches on some of the creepy crawly issues that many people are too comfortable leaving alone.
One of the good things about applying Cryptography to control systems is that it would force these latent and ugly issues to be resolved. But OT staff are not going to be able to resolve these things on their own. This is Engineering. Someone’s PE stamp may be on the documents or drawings. I do not recommend getting in the middle of this without the Engineers. Some of the choices may be the way they are for subtle reasons related to the process. If anyone violates the Engineer’s stamped documents or drawings, they may be held liable for any issues involving personal or public safety.
Before I say anything else, allow me to point out that if you’re using a public Certificate Authority (CA) to interface with the SCADA or ICS system –you’re doing this wrong. Use a private CA. A public CA is intended so that anyone can have an encrypted or an authenticated session with a remote node. However in a control system we need to keep the public at bay. If you install every certificate in to the great CA in the cloud YOU FAIL!
Nevertheless, Engineers are not concerned with making the Cryptography part of this work. They know it works and that there are people qualified to figure out a scheme that meets requirements. The problem is figuring out how to keep the plant or distribution system functioning resiliently across the new states and conditions that the Cryptography adds to the process. In other words, we must discuss maintenance, SOP, failure modes, staffing, logistics, and so forth.
In that spirit, what do we want a process to do when the Cryptography indicates a problem? In an office, the Cryptography fails, and then someone calls IT help desk and says, “the accounting program is broken.” Then someone at the triage help desk says “Have you turned it off and then back on again?” And then they go down a garden path of 20 more questions until, hours later, a person who knows their job well reads the security log and discovers a certificate that wasn’t renewed somewhere. Meanwhile, the office work stops. It’s no big deal. The access was denied, so no data was damaged.
However, on an automated system there is no person at the remote end to call a help desk and diagnose things. The first reflex is that something broke and maintenance staff are called. That’s right, a technician. If the Tech comes back with no trouble found, then the problem gets escalated to Engineering or OT depending on the nature of the problem. If they don’t find a problem, then it might get escalated to a security person. That’s a long time to resolve a problem. Meanwhile the process continues doing something, even if it is isn’t optimal. So the first thing we need when implementing Cryptography is to add many alarms to the SCADA/ICS systems. People then need to be trained on what these alarms actually are. Operations will write SOP documents to describe what they’re supposed to do with that information.
Let’s look at an example: An RTU’s authentication fails. It reports data but the Master front end processor won’t pass the traffic because it doesn’t authenticate. What does this do? Well, the operator displays will be stuck reading the same numbers from that RTU. There may or may not be any flags on that data value. There won’t be any communications errors because the RTU is communicating.
If authentication is used, that failure to authenticate should cause a flag that is displayed and alarmed in front of the operator. Any derived calculations such as a state estimator or mass balance model needs to be aware of the data quality. It is likely that the models or state estimator need to be updated regarding what they should do if one of the values is flagged as suspect. It is work, but it’s not a big deal. There should already be provisions for dealing with flagged data. After all, there are outstation flags if a value is forced or lacks the correct time, that should pass through to the model. –Well actually, it often doesn’t. Older OPC DA drivers or I/O configurations may not have a proper way to map these flags. This is where a lot of hand waving starts. Proponents of Cryptography often treat this as a non-issue, if they address it at all. So this will require some serious review. But this is just the beginning.
What if the problem is with an encryption key at an RTU? There need to be local programs that will take over after a certain timeout of not having a valid connection. The older programs may have derived that from the connection state diagnostics in the automation. But the connection state may still show as good. Note that the network connection isn’t the problem. So if the cryptographic key fails we may want to do something that might secure the site in case of hostile action. Note that in-line encryption often does not convey anything back to the SCADA system that it may be having problems. We need to ensure that the encryption conveys a status (a dry contact closure would be enough) to the RTU so that the operations staff will call the correct people to deal with the situation.
Basically the remote device needs to have a program that will continue to gather event data and issue controls based upon some predetermined schedule or operating procedure. Such procedures may exist for loss of communications. The reaction to failed Cryptography may be very similar, but it is likely that the Engineers will want to take a few extra measures, given that this is potentially hostile behavior. In other words, this is going to take careful review.
My experience is that those loss of communications operating procedure programs are not reviewed as frequently as we’d all like to see. Many aren’t even tested on a regular basis. In fact, it may not be able to do much of anything well except probably leave things the way they are until something trips. Thus, even if procedures exist for loss of communications, the Engineers will still insist on a review –perhaps not because of the Cryptographic issues, but because they know these things aren’t tested or reviewed nearly as much as they should be. This is an example of the creepy issues in many control system designs I mentioned earlier.
Another example: Let’s say that field access to a key server is quietly severed, and then a session key expires a few hours later. What should a critical controller do? How can it communicate? Should it continue to use the old session key? Should it continue in Read-Only mode in the clear? There are good and valid reasons for each of these possibilities. And no, I do not think a Cryptography expert should determine which one of these possibilities is reasonable. That is an Engineering decision.
In another example: What should the control system do if a user ID fails to validate? Should we allow for semi-auto operation with PLC gear that won’t be able to communicate with other PLCs or an HMI system? Let me suggest two scenarios for this example: Is the problem with keys is because someone is using a badge from an employee who is supposed to be on strike; or is it because someone’s access card got damaged and cannot validate properly? How can we help everyone else understand what is going on without having to bring in half of the company to make policy all over the situation? In the former case, you wouldn’t want the controls to be useful for someone who was supposed to be on strike. But in the latter case you probably would. We need policies and contingencies for handling both these scenarios.
Another consideration is how a technician confirms that all is well before leaving the site. On an unencrypted network they can watch the traffic and see the values fly by. But what do they do if it is encrypted? How do they confirm that the smart instrument is scaled properly? How many keys are they going to have to run around with just so that they can replace an instrument that was destroyed by lightning, flood, or fire? How do we track those keys the technician has with provenance?
What do we do if we can’t communicate with that field technician? A lot of remote sites do not have good communications. Shall we issue temporary keys to the technician and then remotely rekey the site when Vinz Clortho the Keymaster meets with Zuul the Gatekeeper on Monday morning? Can we be certain of a chain of trust and provenance to guarantee that no extra keys were installed at the site?
This is why Control Systems Engineers like me get grouchy when people tell me that it is feasible and that encryption is great. The refugees from IT security are eager to show us fuddy-duddy old Engineers that it can work. But if you ask those proponents of Cryptography what should happen when it indicates a problem, all we’re likely to get is a puzzled look like my dog gives me when I talk to him. They’ll probably say it should do nothing. That’s the wrong answer. Unlike office applications, there are physics, chemistry, and biology happening while the control system is busy denying access to others for control. The control system needs to have a strategy for making things safe while the access issues are resolved. The proponents need to realize that there is more to the problem than just the application of Cryptography. If it were that trivial the control system applications would have been early adopters.
The reason Cryptography in to the lower layers of an Industrial Control System hasn’t happened is not because we don’t trust the Cryptography, but because there are many details that need to be resolved, trained for, and operations practices developed. That is where the real expense is. Nevertheless, I still think these efforts are worth doing. However, if you want to push toward Cryptography applications in the lower layers of ICS, don’t go there without solid backup from senior Engineers who know how this stuff works and can develop the control strategies almost nobody acknowledges are needed. It might also be wise to talk to HR about personnel policies and trust. And above all, do not underestimate the work that everyone else will have because of this upgrade. In many cases it might be easier to scrap the entire ICS or SCADA system and start over from scratch. It is going to be expensive and the scope is going to be much larger than just Cryptography.
And when all that is said and done, we need to build a business case for doing this and have hard answers for why it is needed. Otherwise, executives will find other places to invest their money.