*WARNING* These are real incidents from the field and could scare you. A lot! *WARNING*
Liquid aluminium has 3 very dangerous characteristics. It is extremely hot, aggressive to most materials including other metals and responds explosive when in contact with water. Three reasons to be extremely careful when handling liquid aluminum. Safety precautions are important, ranging from protective clothing and masks for operators, safety procedures and training, to safety circuits in the controls of melting furnaces and equipment involved in transporting the aluminum.
When we entered the melting area to inspect the damages and find a possible explanation for the accident, we had the feeling that we were entering a war-zone. A heavily damaged melting furnace that was tilting sideways, somewhere in the middle between completely tilted and vertical position. A burned down forklift. Another forklift thrown aside like it weighed nothing. Four heavily damaged casting machines and a lot of torched equipment. At the same moment we both said “it is a miracle that nobody got seriously injured”. We both had to let it sink in for a while before we started to take a closer look at the damages.
I made clear that I thought this was more a case for the fire forensic team, because I couldn’t imagine that this disaster could have anything to do with Cyber Security. I was wrong because the preliminary report indicated that the safety controls had failed during the filling of a ladle and there was no technical explanation for that failure. Problem was that there was no ladle in sight during the accident and according to the responsible supervisor, the production hadn’t even started yet and nobody was near the furnace when the accident happened. All this was confirmed by the video recordings of the surveillance cameras. So what did happen?
First you need to understand how the filling process works. To fill a ladle with liquid aluminum, this type of furnace furnace is tilted. It might by hard to imagine when you have never seen this before but this is how many large capacity melting furnaces operate. Massive hydraulic systems make a gigantic melting furnace tilt over so the liquid aluminum pours from the holding bath into the ladle. Once the ladle is filled, the furnace returns to its upright position and continuous the melting process. This complicated process is managed by a set of PLC’s and sensors, which also handle a strict set of safety procedures. For example, when the operator starts the filling process by pressing a button on the panel, the safety controls check if there is a ladle placed at the loading station. A simple and important check to avoid accidents. Other logic makes sure that the burners in the melting bath are switched off, for example. All a combination of logic, checks and sensors.
The original loading bay for the ladle had been replaced several years ago with a smart version, which besides the sensor to detect the ladle in the bay, also provided a scale to weigh the exact amount of material during each loading. An important upgrade providing important data to control the process. With that, there was even an additional safety check implemented with the maximum weight indicator setting an alarm on the furnace controls.
By looking at the logic of the safety controls, this accident could have never happened but still it did. A few hours into the first shift after the weekend, so after restarting the melting process but still before the first load of material was taken from the furnace, the furnace had started to tilt and tilted all the way through until the end. There was no operator who had pressed the button on the panel to initiate the process, there was no ladle at the loading bay, the recordings from the surveillance cameras confirmed that. Despite that there was no logical trigger and the safety circuits should have prevented this, the furnace tilted and kept tilting until it had emptied its entire dangerous content on the shop floor.
What happened after that is a chain of dangerous consequences. The aggressive liquid aluminum set fire to the tires of a forklift and then the forklift itself. Next stop in the path of destruction was a casting machine. First the platform, then the control cabinets and finally parts of the machine itself. What followed next appeared on video as if a bomb exploded. The fire forensic experts told me that the liquid aluminum reached the mounting platform of the water cooling packs behind the casting machine. Of course they were placed on a platform so the water would not touch the liquid aluminum in case of an accidental spill of aluminum, that would be very dangerous!
This precautions was however sufficient for smaller spills, not for a melting furnace emptying its entire content on the shop floor… The heat caused the platform foundation to macerate enough until it collapsed under the weight of the 10.000 liter water tank. When the tank sank down, the liquid aluminum caused the tank to burst. 2.000 kg liquid aluminum and 10.000 liter water are an explosive combination and that is exactly what happened.
There were a couple of lucky factors which explained why there were no serious casualties during this explosion. It was still early in the first shift after the weekend so the melting furnaces wasn’t completely filled up yet. During normal operation there would have been twice as much material in the holding bath. Production hadn’t started yet so there were no operators at the casting machines. A few hours later there would have been. It could have been worse, much worse!
Root cause analyses
The plant had changed the service and maintenance from the original vendor of the furnace to a local company. Cost savings, local support, faster response times, the usual argumentation. The cooperation had worked to the full satisfaction of the plant and there was no reason to complain. So far…
The at first very collaborative IT manager of the plant had investigated all logfiles to see which data was going where, if the systems had worked according to specifications during the accident, etc. Much to his surprise, he discovered that an Administrator account had connected remotely to the IPC and PLC’s of the furnace through the service connection 24 minutes before the accident and disconnected 8 minutes before the actual explosion. We could quickly track the IP address to the service provider.
We discovered that the service provider should have done an onsite routine calibration of the burners in the melting furnace during the weekend. This should have been completed before 14:00 that Sunday, but apparently the responsible technician had been running late and he decided to do it remotely during the night. The furnace was already operational when he logged in through the remote connection and he should have seen that on his status overview. Unfortunately, he looked at the wrong furnace because of a mix up of 2 addresses in his local configuration of the maintenance tool.
In his status overview, the furnace was still in standby mode and he worked his way through the diagnostics and calibration. After several attempts of re-calibration to which the furnace wasn’t responding in his status overview, he made a critical mistake. Instead of looking at the log entries which would have showed him the different addresses, he decided to initiate a reset routine without even understanding what that reset routine does. When he still didn’t see the furnace responding on his status overview, he simply disconnected and was planning to try again the next day. The next day the accident was in the local news and instead of speaking up about his actions, he decided to stay silent in fear of consequences.
The reason that he didn’t see the furnace responding is the mix up of addresses in his local configuration. Where his status overview was reading data from the second furnace in standby mode, his maintenance program was sending commands and data to the furnace in operational mode.
The reset routine is a very special function, which calibrates all states of the furnace and its systems. The execution of this function should be supervised, precautions have to be taken before execution of this function. One of the precautions is to make sure that the holding and melting baths are empty, and there is a very good reason for that and for all the other precautions. Normally, this reset routine is only used onsite by qualified maintenance operators of the vendor during commissioning of a furnace or after significant repairs.
On the control panel of the furnace, a password needs to be entered followed by three confirmations and explicit warnings, and then a key lock has to be set in the highest maintenance mode to start this function. None of that applies however when connected directly in manual debug mode to the control program. A simple command can execute the reset routine and that is exactly what the technician did. In manual debug mode, which disabled all safety logic, the furnace controls run through the preset processes to reset the entire furnace. One of these processes is to once tilt the furnace to the end position, calibrate the various sensors and hydraulic pressure measurement and bring the furnace back to its original vertical position to calibrate and verify that. The technician however thought that the reset routine would simply restart the control systems…
The smart loading bay was completely destroyed by the liquid aluminum and the explosion, but the Industrial Control Assets for it were still in tact. We found that the alarm was properly set to indicate that no ladle was present. It must have been shortly before the complete meltdown of the scale that the alarms were set for exceeding the maximum time and maximum weight. Shortly after that, the alarm was set that the connection to the loading bay was interrupted.
On the Industrial Control Assets of the furnace the alarm was set that the key lock was not in the highest maintenance position. The alarm was set that remote access was active, followed by the alarm that the manual debug mode was activated and normal operations was disrupted. Also the alarm was set that the burners were still on although the furnace had left the locked vertical position when the furnace started to execute the reset procedure. Even the alarms from the loading bay were duplicated and acknowledged. All systems, logic, checks and alarms did exactly what they were intended to do without any impact. In manual debug mode, these are all disabled!
There was a procedure in place for supervised remote access to the control systems, which also described that this remote connectivity had to be enabled only after approval and with one-time temporary accounts, and had to be disabled immediately after completion of the work, including the temporary account. Because the IT-Department wasn’t able to provide 24/7 support, the management of this process was handed over the Maintenance Department. It must have been too much fuzz because the connection was always available and the accounts used by the external service partners were actually the Administrator account which was created for the Maintenance Department. The statement by the responsible Maintenance Manager explained a lot: “It is very annoying that the IT-Department deletes the temporary accounts and it works much faster with administrator rights. We don’t have time do all the paperwork and steps to turn the connection on and off, IT should do that. It is their network, I am not responsible”…
The insurance companies of the plant and service provider initiated negotiations about a shared responsibility for the damages, which were estimated to exceed 4.000.000 €. The costs of the lost production capacity are estimated to reach an additional 3.000.000 €. Several people lost their jobs and face the risk of damage claims, including the responsible technician, the Maintenance Manager and the IT-Department.
The technician obviously had no clue what he was doing and should have never had to opportunity to cause the damages that he caused. All service activities and modifications for Industrial Control Assets should be restricted to qualified personnel and always supervised. Proper execution of access control would have prevented this. At no point in time should it be allowed for an account with Administrator rights to log in remotely. Proper monitoring of the IT-Infrastructure would have shown the unauthorized access to the network, and the failure to restrict access through the service connection. Standard Cyber Security systems hygiene could have and should have prevented this! Lack of executing proper access control allowed a totally unqualified person to cause significant damages.
Cyber Security is not just about keeping the bad guys out. Cyber Security is also about preventing internal unauthorized and in this case unqualified access!
- Back to the Future Cyber Security – A manifesto for Cyber Security and the Industrial Legacy
- Back to the Future Cyber Security – All updates