What’s behind global IT outage?
A major IT outage has hit businesses across the world, grounding planes as well as affecting banks and the healthcare sector. George Kurtz, CEO of IT security firm Crowdstrike, said it had traced the issue to a “defect found in a single content update” for the security software it provides for the Microsoft Windows operating system on computers. Microsoft said the issue was caused by an “update from a third-party software platform” and that the “underlying cause” had now been fixed. The Conversation spoke to Professor Alan Woodward, an expert in cybersecurity at the University of Surrey, about what went wrong and how the problem could be resolved.
Can you explain what’s happened here?
I think there are two things. First, Microsoft seems to have had a problem with its Azure cloud computing platform. It’s a bit unclear, but there was a degree of degradation in that service starting in the evening of 18 July. However, it didn’t fail altogether. But by far the bigger problem seems to be an update that appears to have been done in the late evening of July 18 for [IT security company] Crowdstrike’s Falcon product – a computer threat checker. Falcon works by having some “agent” software deeply embedded in the operating system of every PC running Windows, which monitors that computer and “calls home” if there’s a problem. It also receives updates on what to look out for if there’s a threat. It’s used a lot by large organisations throughout the world, which have a huge number of PCs to police.
I’m sure Crowdstrike is urgently investigating what happened. This piece of software is designed to protect people from ransomware attacks and the like. From the latest information I’ve seen, it looks like the update system file was somehow released in an incorrect format.
The Windows operating system gets to this update and it doesn’t know how to cope, so it crashes. That’s why people have been getting the “blue screen of death” [a computer screen with an error message indicating a system crash].
And the big problem is, you can’t fix this issue remotely. You have to go into every machine separately and put it into “safe” or “recovery” mode to isolate the software. From there, you should be able to reboot the machine and get it up and running again. But if you’re a big global company with a large distributed IT estate, that’s going to take a long time.
Why has this outage had such wide-ranging effects?
Crowdstrike has been a great success – its security software is used by hundreds of thousands of major clients around the world. So airlines, airports, railways, hospitals, stock exchanges … they’re all going down. It started in Australia when they got up for business on Friday. The update had clearly been sent out last night UK time, and it has just rippled around the world.
With deliberate ransomware attacks, they’ll typically take out one or two targets at a time. But in this case, it’s happened to thousands of organisations at once. We’ve not had anything like this before. How Crowdstrike will fix the software is yet to be determined. As I’ve explained, it’s clear how companies can work around the issue. But for some very large organisations, this could affect their critical infrastructure and business for a long time yet – it’s going to take them days to physically work round all those machines.
Can security companies ensure this doesn’t happen again?
Security software is very intertwined with a computer’s operating system – it’s buried deep in there. There has to be a way that if something is found to be corrupted, it doesn’t just keep crashing the system – this may have to be done in cooperation with Microsoft, which owns the Windows operating system.
There’s got to be some way of backing out of it, and there is. However, most people trying to log into their blank PCs don’t know how to put their PCs into safe mode and revert to a previous state.
At the moment, it looks like it’s one corrupted file that’s producing a global problem. Computers download updates all the time, so how Microsoft prevents that from happening with this update, I don’t know.
It’s not immediately obvious. And the million dollar question is: how did this corrupted file get released in the first place?
How long before this problem is fully resolved?
It’s certainly going to take days, if not weeks. It’s like those hospitals in London that got attacked with ransomware.
They’re still suffering – there’s a very long tail on these things. And in this case, it’s not just a long tail but a very broad swathe of global organisations in transport, health and everywhere else. I don’t think we’ve seen anything like this before.
(Writer is Professor, Department of Computer Science, University of Surrey; https://theconversation.com/)