By Annika Burgess for the ABC
CrowdStrike would be feeling "very embarrassed" after issuing its Root Cause Analysis (RCA) of the faulty software update that led to potentially the largest global IT outage in history, experts say.
It came down to a mistake first-year programming students are taught how to avoid.
On 19 July, the fateful Blue Screen of Death (BSOD) Friday, about 8.5 million Windows systems around the world went into meltdown when an update for CrowdStrike's Falcon sensor product went very wrong.
The US cybersecurity company released a preliminary report days after the incident.
Now a more in-depth, 12-page analysis has confirmed the root of the cause - one single undetected sensor.
Falcon's privileged access
CrowdStrike offers ransomware, malware and internet security products almost exclusively to businesses and large organisations.
The widespread outage has been linked to its Falcon sensor software, which is installed to look for threats and help lock them down.
Sigi Goode, a professor of information systems at the Australian National University, said Falcon had very privileged access.
It sits at what is called the kernel level of Windows.
"It's sitting as close to the engine that powers the operating system as possible," Professor Goode said.
"Kernel mode is constantly watching what you're doing and listening to requests from the applications you're using, and servicing them in a way that appears seamless to you."
He described kernel mode as the traffic police that Falcon sits alongside, saying, "I don't like the look of that vehicle, we should take a look at it".
The sensor 21 culprit
CrowdStrike is constantly updating Falcon.
On July 19, the company sent out a Rapid Response Content update to certain Windows hosts.
In the RCA, CrowdStrike called it the "Channel 291 Incident", in which a new capability was introduced into Falcon's sensors.
Sensors are like "a pathway for evidence," that tell it what sort of suspicious activity to look for, Professor Goode said.
"Falcon is looking at a range of sensors - a range of indicators - to see if something is wrong," he said.
When updates are sent, it changes the location or the number of sensors to check for a potential attack.
In this instance, Falcon expected the update to have 20 input fields, but it had 21 input fields.
This "count mismatch" is what caused the global crash, CrowdStrike said.
"The Content Interpreter expected only 20 values," the RCA report states.
"Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash."
Because Falcon is so tightly integrated into the core of Windows, when it crashed it brought down the entire system causing the BSOD.
Professor Goode said some of the most common ways to compromise a system were to flood memory.
Essentially, you tell the computer to look for something "out of bounds".
"It was looking for something that wasn't there," he said.
"But Falcon had to look in that 21st location, because that's what it was told to do by the new template it was given."
How can this happen?
CrowdStrike has apologised for the failure which has led to its chief executive, George Kurtz, being called to testify before the US Congress to explain what happened.
"We are using the lessons learned from this incident to better serve our customers," Kurtz said in a statement this week.
"To this end, we have already taken decisive steps to help prevent this situation from repeating, and to help ensure that we - and you - become even more resilient."
CrowdStrike's quality assurance (QA) processes have come into question.
The company has said that its updates "go through an extensive QA process, which includes automated testing, manual testing, validation and rollout steps".
But Rapid Response Content, which was used in this instance, goes through a different process.
In the report, CrowdStrike admits that "lack of a specific test for non-wildcard matching criteria in the 21st field" contributed to "the confluence of these issues that resulted in a system crash".
Toby Murray, associate professor at the University of Melbourne's School of Computing and Information Systems, said the "dodgy data file update" was "embarrassing".
He said even basic checks by a human developer would have found the problem.
"That is an incredibly basic and fundamental mismatch that was always going to lead to catastrophic problems, sooner or later," he told the ABC.
"The fact that the CrowdStrike developers were able to have this obvious inconsistency between the data file format and the software code means that the most basic forms of quality review and assurance were not being correctly carried out."
Professor Goode said this kind of mistake shouldn't be happening.
He said the update should have been released through a staged deployment.
"When they wrote this report, they must have been feeling very embarrassed," he said.
"First-year programming students are taught about the 'stack', the series of instructions that need to be executed in a CPU (central processing unit)."
CrowdStrike announced it had engaged with two independent software security vendors to conduct a further review of the Falcon sensor code for both security and quality assurance.
Calls for accountability
In the wake of the outage, regulators and businesses have been considering legal implications.
The incident sent airports into chaos, supermarket check-outs stopped working, and media outlets struggled to bring you the news.
In Australia alone, the impact on businesses has been estimated at more than $A1 billion ($NZ1.08b).
Australian Industry Group chief executive Innes Willox told ABC's The Business he expected the damage bill from the glitch to run into the billions of dollars.
But he said it was still unclear whether affected businesses would be able to seek compensation from CrowdStrike for any losses incurred from the outages.
America's Delta Airlines last week said the outage had cost the company $US500 million ($NZ 834m) and that it planned to take legal action to get compensation from the cybersecurity firm.
CrowdStrike has rejected the claim, saying in a letter from an external lawyer that it is "highly disappointed by Delta's suggestion that CrowdStrike acted inappropriately and strongly rejects any allegation that it was grossly negligent or committed misconduct".
Delta cancelled more than 6000 flights over a six-day period, impacting more than 500,000 passengers.
It faces a US Transportation Department investigation into why it took so much longer for it to recover from the outage than other airlines.
This story was first published by the ABC.