*WARNING* These are real incidents from the field and could scare you. A lot! *WARNING*
The plant was still looking new after 3 years of operations, nothing like its sister plant which had a history of 25 years and was still going strong. New equipment, new technology and lights everywhere. Just this bright environment with the modern roof and lighting systems already made a big difference. Build for the future and yet 4 steel presses were standing still. The much older presses in the sister plant were running on full load to compensate the lost capacity. What happened?
You might have guessed it, a serious cyber security breach happened and it was a very expensive one!
The plant was build around flexible and highly automated smart equipment, basically an operational advertisement for Industry 4.0, LEAN and Agile. Flexible cells instead of complicated production lines. Fully integrated systems all around. But the highlight were the 4 stage steel presses. The crown on all efforts to make the plant flexible, these presses were designed from the first moment to be able to switch tools and processes rapidly and with high flexibility. Its major innovation compared to the classical steel presses are the loading bays for tools and the capability to automatically load and offload tools on a running press!
No more stopping the machine and spend hours to change the tools manually before the process could be restarted. No more tools that are linked for the 4 stages of the press. These presses could load and offload the tools for each stage individually! Magnificent and we were sure it would be a great pleasure to see them operate and perform a tool change on the fly. If only they wouldn’t be stopped and one of them severally damaged due to an accident during the tool change.
We found the possible cause very quickly because the plant team already told us what happened, we had to investigate if this was the actual cause of the accident and loss of production capacity. The plant was highly automated and so was the tool change process, which had to be synchronized with the production process and controls of the press to be able to do it on the fly.
And that’s where it wend wrong. The systems to control these processes and provide the data were still “under development”, as the Plant Manager described it. Nobody understood why and how, but the main system and its redundant fail-over backup system were doomed. Ransomware infection! In an industrial network behind several layers of defense… Of course the databases were backed up and it would take little effort to rebuild the systems, that wasn’t the problem.
The problem was that one of the presses had crushed tools and damaged several subsystems during what should have been another automated tool change on the fly. Our task was to determine and proof if the ransomware issue of the system caused the damage.
The insurance company and the corporate IT-Department had agreed to bring in a certified Systems Auditor to determine how these systems could have been infected with ransomware and the report was available faster than we had hoped. First detected problem was that the systems had been provided by the vendor during the pilot phase and after that simply migrated to production systems without ever going through the mandatory security lock-down and standardization. As a result, the system didn’t have the standard anti-virus client installed and some other crucial security measures failed.
The report also discovered how the malware reached the system. Physical access to the systems was restricted because it was mounted in a server-room so the various team members used a chain of remote desktop connections to access the systems. Analyses of access to the infected systems showed that many files were transferred using the remote desktop connection capabilities. Because the end-point had no anti-virus client installed and the anti-virus clients at the remote desktop hosts didn’t trigger as long as the files are not stored, this explained how the malware slipped through the safety net.
So we knew how the malware was able to infect the systems, but still didn’t have evidence that it had caused the press to crash the tools and subsystems. Not so fast when you think “of course it did!”, there are several systems and fail-safes and security features involved here. The systems that were hijacked by cyber criminals were merely provided the technical data about the tools and the sequence on the loading bays. Dimensions, positions of the grippers, those kind of things. The whole process of actually loading and offloading was handled by other systems and monitored by a chain of sensors and logic. These systems were more or less a smart database, translating the part numbers into technical data for the automated process.
No, the ransomware itself could not have caused this problem. System failure, no data available, end of story. Well, that’s the theory. Reality looked different and the plant estimated that they would need at least another 3-4 shifts to do all the repairs to the press and its subsystems. The damaged tools were scrap, beyond any hope of repairs.
Root cause analyses
We started with the following assumption: the systems got infected and hijacked, and they either provided wrong data or no data. We tracked this down with the vendor of the first system relying on data from the infected systems. We quickly found that the interface is straightforward database connection. The logfiles showed that the last received data for the previous tool change was valid, and after that the logfile showed “No response” error entries. So it appeared that corrupt data was not the root cause.
Further down the line of systems and controls, we found the same event entries in the logfiles. We were rather clueless about what had happened at a technical systems level because all systems and their vendors showed us that nothing wend wrong. All systems had detected the error and received no further data from the infected system. The press should have just not been able to perform the failed tool change because non of the systems received the required data.
Was this going to be one of those rare cases were it seems so obvious what the cause was but there is no evidence to support that theory? It started to look like it. Until we decided to have a closer look at the error handling of the involved systems. We found that the system which handles the loading and offloading bay for the tools sets the tool code to “NULL” when no data is available, an inherited standard from a previous project. We also found that the system responsible for the actual loading and offloading of the tools interprets tool code “NULL” as bay empty because the tool code is automatically scanned when the tool is placed on the bay. To make it even worse, the controls of the press itself interpret tool code “NULL” as no tool present at the location.
We now had dotted out that various systems had different interpretations of the value NULL as code for the tool. A brainstorm session around the flipchart we placed by the machine (yes, I am a GEMBA person in all I do!) with the vendors and technicians made crystal clear what happened.
The system responsible for providing tool data based on the product code wend down, and so did its backup system. The loading and offloading bay responded to not receiving data by setting the tool code to NULL without interrupting the process. The loading and offloading process interpreted NULL as indicator that there was no tool on the bays and therefor didn’t respond to the instructions to change the tools. The same NULL was interpreted by the press controls as no tool present at the location. Once we established that and all vendors confirmed it, the vendor of the press took over the root cause analyses.
The press levels the product side of the top and bottom tools to make sure the product is moved leveled between the tools. This is done by moving the bottom tool downward and top tool upward, according to the dimensions of the tool. The dimensions of the tools are to be provided by the chain of systems, starting with the systems that were hijacked by ransomware. Seeing the tool code NULL is interpreted as no tool present, by which the press doesn’t align the grippers of the tool and go to empty position. That same NULL that made the loading and offloading bay believe there was no tool present, and the loading and offloading system believe there was no tool to be handled.
In reality, there were tools mounted on the press. Four of them at the moment that the press crashed them with brutal force, the same brutal force it is designed to form steel chassis parts with. This time the tools, which can vary in height between 20 cm and 100 cm, were still mounted on the press without the press controls being “aware” of it.
A lot of Cyber Security hygiene failed in this case. The infected systems, the backdoor by using file transfers through remote desktop, the missing lockdown and anti-virus client, all of it. But much worse was the fact that the systems involved had different interpretations of the same status with very costly consequences and that was not related to the ransomware infection. We could very easily simulated this situation on the other presses which were still not receiving data from the central systems to provide data (including dimensions!) of the tools. This time without tools mounted on the press, we simply started the process knowing there was no data available. NULL everywhere and the press returned the grippers to the NULL position; no tool present.
Based on our findings and the confirmations from the various vendors, the insurance company proposed a shared responsibility solution which all parties involved gladly accepted. The damages to tools, press and subsystems was estimated at 1.300.000 €, not including the costs of lost production capacity which the company decided to carry themselves.
Based on the findings, the Corporate IT Department implemented several corrective actions. A security policy was pushed to disable file transfer through remote desktop connection on all systems and for all new systems. The monitoring systems, which had successfully detected the non-standard systems in the network, which changed so it would create a critical ticket for IT whenever it detects such system. The also took away the rights of set a warning to resolved without creating a ticket, which was found to be the way the alarms about these systems were handled. Several other corrective actions were taken without much discussion.
The team responsible for the implementation of this otherwise fantastic automatic loading and offloading systems had a lot of homework to do. Check all interpretations and handling of signals and codes, validate that these are handled in the same manner throughout the process. All work that should have been done long before the system was commissioned. And the most important and difficult lesson was to break the circle of only doing positive testing. Simply disconnecting the network cable had already made their flaws visible but in the enthusiasm of creating something fantastic, they were never triggered to test for simple failures. An expensive mistake!
Cyber Security is not just about keeping the bad guys out. Cyber Security is also about critical thinking and finding the flaws before they cost real money!
- Back to the Future Cyber Security – A manifesto for Cyber Security and the Industrial Legacy
- Back to the Future Cyber Security – All updates