*WARNING* These are real incidents from the field and could scare you. A lot! *WARNING*
We talk a lot about what could go wrong with Cyber Security and what could be done to prevent it. Let’s have a look at what already happened and what should have been done to prevent it. Just to make sure nobody will still have the feeling that it ain’t that bad in reality. These are examples where internal and external safety measures failed completely with dramatic consequences. These examples of real incidents occurred in the last 12 months and could have ended much worse.
As consultant for Applied Information Systems, Industrial Automation and Cyber Security, I was involved as expert on behalf of the insurance companies to perform the analyses of the failures leading to these horrible accidents. In some cases, I could somehow understand that a potential risk was overseen due to lack of the right focus on the weaknesses. In several other cases, I was shocked that such clear and visible vulnerabilities were solved long before they lead to dangerous accidents.
The packaging robot got an important safety upgrade after the project had failed a safety audit over a very serious safety issue. The robot area had of course a safety fence surrounding it, but there was a huge design flaw. When the operator would open the gate to the robot area, the robot would continue to work and move which is a serious safety issue. The engineers should have taken care of this in the design but at least the issue was discovered before the robot cell was commissioned. Serious accidents could have happened!
Solving the issue was relatively simple and low cost. The lock of the gate was replaced with a safety local with Normally Open Switch which was connected to the robot control unit. The lock of the gate had to be closed and sealed, otherwise the safety loop would cut the power to the robot. The safety lock could not be sealed from the inside so even closing the gate after entering would prevent the robot from continuing its operations. Various other safety logic was implemented. A spring would close the gate automatically after opening it and the gate had to opened twice to reset the interruption of the safety loop. If the gate was kept open for longer than 30 seconds, a full reset of the safety loop was required to prevent that the safety circuit could be tricked. Even when the cable between the lock and the robot control unit would be damaged or cut, the safety loop with interrupt the robot because of the Normally Open Switch.
The updates were tested intensively before the auditor was invited to test the robot cell again. The auditor also tested the new solution intensively and found no flaws. The improved robot cell passed all positive and negative tests. Thumbs up, the cell was put into production.
At first, the operators were somewhat skeptical about trusting the safety precautions but after a while it became the normal routine when work was to be done within the safety area of the robot cell. Open the gate, robot stops, do what needs to be done. Leave the cell, close the gate and seal the lock, robot continues again. No need to restart the robot, no need to sync the processes. This actually worked great, especially when we consider that the operators enter the safety area in average twice per hour.
And then it happened. The operator had to remove scrap material which had fallen of the side of the conveyor which feeds the robot. A routine operation. The operator opened the gate, the robot stayed in its home position, the gate closed automatically behind the operator, everything the way it was supposed to be. At least that is what the operator thought.
While picking up the scrap materials, the robot left its home position at full speed and on its way to the conveyor, hit the operator in his back and legs. You might call it luck that this made the badly injured operator fall forward on the floor, otherwise the robot would have hit him again on its way back. The operator shouted for help but his colleagues couldn’t hear him in the loud factory. Some time later, the robot was still running as if nothing had happened, the supervisor made his round and noted a lot of scrap material in the cell and “something strange in the corner”.
While opening the gate, the robot kept moving. Good that the supervisor was lucky enough to notice this. Luckily the robot wasn’t waiting in its home position, otherwise the supervisor might have not even noticed the dangerous error. The supervisor didn’t think long and acted the right way immediately: Emergency Shut Down button! The seriously injured operator was found and brought to the hospital. It took 2 months to recover completely from his injuries. Despite his terrible accident, the operator considers himself a lucky man. Lucky because the robot only hit him while moving towards the conveyor. Because the robot movement from the conveyor to the loading station is about 1 meter higher. High enough to hit his head with full force…
Root cause analyses
The responsible technician was flabbergasted when investigating where the safety loop wend wrong. Not because he couldn’t find the error or didn’t understand what had happened. He was shocked that he wasn’t able to find the safety loop in the robot controller unit. Wired the way it was supposed to be, signal functioning the way it was supposed to be, the program loop was missing as if it had never been there. Later the technician stated that he felt like in one of those science fiction movies where someone suddenly is catapulted into a parallel dimension where things are simply different.
We investigated the installation and pretty quickly became a feeling for what might have happened. Within a period of 30 minutes, we noticed 3 technicians hooking up their notebooks to the next robot cell which was still under construction. Judging by the logos on their boiler suits, there were two technicians from different external companies and one from the factory itself. We didn’t see them entering passwords or do any other kind of authentication. Plug in the cable and do whatever it was they were doing.
The expert of the robot supplier we called arrived the next day and didn’t need much time to confirm our suspicions. All access control and authentication mechanisms had been switch off. Even the key-lock which was designed to prevent access to the programs and settings of the robot control unit was disabled. He also discovered that 2 days before the horrible accident, the currently active version of the program was uploaded to the robot control unit and activated. This version was significantly older than the version which had the safety loop for the gate, we assume it was the version which had failed to audit… The safety loop hadn’t disappeared by magic. It was simply replaced by an old version of the program with dramatic consequences.
We of course looked at the procedures and standards the company had implemented for making modifications to the Industrial Control Assets. This was a very short session because there was none. The review of the version control system for Industrial Control Assets was just as brief since it also was non-existing. The responsible HR Manager wasn’t able to provide was an overview of training of the technicians and the responsible Maintenance Manager was honest enough to admit that he didn’t know who had the access and the tools to install modifications to the Industrial Control Assets.
Based on our findings, the insurance company concluded that the costs would not be covered by the insurance policy. The trade union and the workers council demanded a full inspection of all safety circuits and the way they were protected against such crucial mistakes. The injured operator successfully claimed compensation from the company. The total damages and costs for the company were estimated to exceed € 250.000, not including the costs of corrective measures throughout the factory.
The available access control features could have prevented unauthorized access to the robot control unit, and a proper version control system could have prevented that the wrong version of the program was installed and activated. Regular training could have made the involved personnel aware of their responsibilities and the risks.
Human error on human error ended with a technician doing something very stupid and an operator getting severely injured as result. We could of course put the blame on the technician but the problem in this case started with failure in leadership. All of this could have been prevented, but unfortunately there haven’t even been the slightest attempt to do so.
Cyber Security is not just about keeping the bad guys out. Cyber Security is also about preventing internal unauthorized access!
- Back to the Future Cyber Security – A manifesto for Cyber Security and the Industrial Legacy
- Back to the Future Cyber Security – All updates