Friday, 23 October 2015

The World’s Most Expensive Software Errors

The World’s Most Expensive Software Errors

Coming on the back of some recent revelations regarding software in cars, Mark Pickavance looks at the errant software that cost companies millions

The power of software is that it’s a flexible component, able to be adapted quickly to meet new requirements and challenges. But as much as that malleability is a strength, it’s also a weakness, because often changes can occur rapidly and without sufficient testing before they’re deployed. Sometimes this can cause inconvenience, but on occasion it can also be horribly expensive for the company involved.


Here, for your delectation, are some of the worst software blunders ever and their estimated financial impacts.

EDS Child Support

EDS Child Support


The lamentable track record of the public sector in the procurement and implementation of IT systems is astoundingly bad, but even by these remarkably low standards, the system that EDS created for the Child Support Agency (CSA) became the poster boy for bad government software development.

A modicum of blame must go to the department of Work and Pensions, which decided to entirely restructure its operations at the same time as EDS decided to roll out it new CSA system.

It appeared that no one had considered that the interactions between these two departments would no longer work, as their systems became entirely incompatible. The CSA system was worst affected, having at one point some 500 documented bugs and many more left undocumented. It ended up allowing the CSA to overpay 1.9 million people, while underpaying a further 700,000 and resulting in more than £5 billion in uncollected child support payments.

During this period, the CSA had the highest number of public complaints annually by some considerable margin, as it failed to support those in need of support or to effectively pursue those who needed to pay it. It was eventually admitted that the CSA spent £1 for every £1.85 that it managed to get paid in spousal support, while running a backlog that reached more than 240,000 cases at one point.

A huge amount of money was never recovered, and the CSA system cost more than £1 billion of taxpayers’ money, and yet it couldn’t cope with the simplest changes in circumstances that people would commonly encounter.

A divorced friend once recounted to me something he was told by a CSA case worker on the phone. She told him categorically not to tell her if he patched up his marriage, because placing that information in the system would mean his records would be entirely inaccessible, possibly forever.

Eventually the role of the CSA was taken over by a new department, the Child Maintenance and Enforcement Commission, and a new system was commissioned  for it to do the work that the EDS one gloriously failed to allow.

The ‘Child Maintenance Service’ is on track to be delivered in 2017/18, though the National Audit Office has already voiced concern about cost overruns.

Initially budgeted at £149m in 2011, that figure was up to £275m in 2012 and again to £352m in 2014. A total figure of £950m is now predicted, though this might become the second billion spent on this problem by the time it’s complete.

Scandinavian Airlines Flight 751

Scandinavian Airlines Flight 751


In late December 1991, 129 passengers and crew boarded a Scandinavian Airlines McDonnell Douglas MD-81 in Stockholm. Their destination was Warsaw via Copenhagen, and while it was a typically cold day at the airport, there was nothing especially unusual about the conditions or flight.

In command was Danish captain Stefan G Rasmussen, aided by Swedish first officer Ulf Cedermark, a very experienced team of aviators. But after just 25 seconds of flight, they realised that their aircraft was seriously in trouble, after both engines began to make loud banging noises.

What was later determined was that as Flight 751 had sat outside overnight, the very low temperatures had chilled the fuel in the wing tanks, causing a substantial layer of clear ice to form on the wings – ice that two de-icing cycles had failed to shift. When the aircraft rotated at lift-off, the wings flexed, cracked the ice, which was then ingested by the MD-81’s engines, which are mounted on the rear of the fuselage.

The noise coming from the engines is indicative of compressor blade stall, where air flows in the wrong direction through the engine, creating repeated bangs very much like an old car backfiring.

Instinctively, Captain Rasmussen closed the throttles to reduce the amount of air and fuel (and ice) that the engines were attempting to ingest as per his training, in the hope that the problem would clear. Under normal circumstances this would have worked, as the engines are designed to handle ice and rain being sucked into them during normal operations.

However, what he didn’t know was that McDonnell Douglas had made a software modification to the MD-81, to address a perceived problem with pilots throttling back on their climb away from the runway to reduce the environmental impact.

The software system, called ATR (Automatic Thrust Restoration), promptly moved the throttle back from where they’d originally been set, causing the engines to completely destroy themselves before Flight 751 had been airborne for 80 seconds.

At this time, the plane had only achieved a height of 3,220f – insufficient to consider turning back to the runway.

With no power and no runway, only the amazing skills of Rasmussen and Cedermark managed to bring their stricken aircraft down through a pine forest and into a frozen clearing without any loss of life. The aircraft was totally destroyed and 100 passengers and crew suffered mostly minor injuries during the crash, which broke the fuselage into three pieces.

A later investigation cited the failure to remove ice and a software modification that the airline SAS appeared not to be aware of, which it therefore hadn’t trialed its pilots to consider during an emergency.

No price has ever been put on this incident, but the number was substantial when the cost of replacing the aircraft and the various personal claims were assessed.

Sadly, Captain Rasmussen felt unable to return to duty due to the physiological impact of these events and has not piloted an aircraft since.

Pilots are now trained to realise that ATR might countermand their throttle settings on this aircraft if they’re forced to divert from their normal flight departure profile.

Mars Climate Orbiter

Mars Climate Orbiter


The Americans love imperial measurements, even if the people that devised them moved to metric some considerable time ago. Because of this, they often make horrendous mistakes when converting from one standard to another, and they often need to do this because almost everyone else uses metric, because it’s easier not to mess up.

A very good example of this was the Mars Climate Orbiter. Launched in December 1998 on a Delta II 7425 launch vehicle, it arrived on Mars approximately ten months later, after a journey of 669 million kilometres.

At this point, it was meant to execute a final big engine burn to insert the orbiter into its initial orbit of the planet, ahead of surveying the unique weather of the red planet.

Unfortunately, Nasa got a hint that everything wasn’t perfect when the spacecraft passed behind Mars 49 seconds before it had assumed it would, losing contact, and then it didn’t emerge on the other side.

A later investigation determined that the software written to control the orbiter’s thrusters had been programmed with pound-seconds force (lbf-s) instead of the specified metric units of Newtonseconds (N-s). This resulted in it being significantly closer to Mars than it should have been, and instead of being caught by the gravity of the planet, it ploughed straight into the upper atmosphere and was promptly vaporised.

The cost of developing the probe, an associated lander and manning the mission was $327.6 million, with no appreciable scientific return on that investment other than to be a footnote in the annals of poor software development control.

Lockhead Martin was responsible for the module that produced results in pound-seconds force, contrary to the software interface specification, and those numbers were handed to a Nasa written routine that was following the right specifications.

What annoyed many on this disastrous project was that two of the navigation team, who had noticed that the spacecraft wasn’t tracking as it theoretically should, had raised concerns, which were then ignored.

Ariane 5 Flight 501

Ariane 5 Flight 501


You’d think that the European space agency might have learned a few lessons from Nasa’s Mars experience, but apparently it didn’t. In 1996, on the brink of offering a large payload space transportation service, ESA fired its biggest space transporter from its launch facility in French Guiana. The huge Ariane 5 lifted off from the pad majestically, and just 37 seconds later the rocket exploded, taking a £500 million satellite payload with it.

Failure with launch vehicles isn’t exactly unknown, and with the energies involved, the results of a minor fault can often be catastrophic. Initially, engineers looked for a manufacturing fault or a component failure that might account for such complete failure. In the end, it was determined that the flight control software was the culprit, when it tried to force a 64-bit word into a 16-bit memory address. In doing so, it crashed the engine control software module. Luckily for the launcher, a backup system was available and took over controlling the engines. Regrettably, it had exactly the same coding fault, and crashed also.

With buffers overflowing in the code, the engines were commanded to deliver more than 100% of power and complied by converting Ariane 5 Flight 501 into a billion tiny pieces of scrap metal floating down over the Atlantic Ocean.

Curiously, this code was unchanged from the previous Ariane 4 design, the only difference being the level of engine output. With less power, the Ariane 4 never produced a number that was larger than 16 bit and therefore never crashed the management module.

The cost of developing the vehicle was $8bn, and due to this error and other problems, it took three attempts to get one that didn’t just explode like the world’s most expensive firework.

Intel Pentium FDIV

Intel Pentium FDIV


Given the complexity of chip design, that they’re not perfect from the outset isn’t that surprising. That’s why they go through various engineering revisions before being released to the public. However, Intel has had a number of major mistakes make it outside its labs, and the biggest one came in 1993 when it launched it flagship replacement for the highly successful i486 the Pentium.

On previous designs, Intel had segmented the floating point engine into a separate chip, but on the Pentium the hardware was all in a single die. But somehow in the process to combine the two components at chip level, something went awry in the microcode, and the floating point engine lost the ability to divide numbers correctly.

Some of the first people to notice this were those who used spreadsheets, who discovered that accounts that previously balanced no longer did.

At this time, probably the two applications that PCs used the most were Microsoft Excel and Lotus 1-2-3, so the importance that these machines should be able to calculate accurately was paramount.

Unhappy that they got shiny new Pentium machines, only to be forced to dig out their old i486, affected customers soon let Intel know of their displeasure. Eventually, unable to handle the widespread public backlash, it agreed to replace any defective chips with ones that could perform division to customers’ satisfaction.

The cost to the company was $475m at the time, which translates to $783m in 2015 dollars. That said, in 1993 only 152 million personal computers were sold, but just ten years later that number was north of 600 million, the majority of them using Intel chips or its licensed technology. Therefore a bug of this type now would have a bigger impact on Intel, which has been roundly trumped in the tablet and phone markets by other chip makers.

BlackBerry Blackout

BlackBerry Blackout


It’s easy to forget that when President Obama came to office, he was probably the most famous BlackBerry user on the planet. Where Apple had success with a certain personal sector, BlackBerry was seen as a more appropriate tool for those in business.

It was a combination of two features that made Research In Motion (RIM, now known simply as BlackBerry) desirable for the business community: the full keyboard and the ‘push’ communications system. That second aspect was critical for BlackBerry users, because getting critical information long before others can be the difference between succeeding and failing in many markets. And with ‘push’, BlackBerry people always got the good (or bad) news first.

But RIM’s greatest advantage also became its undoing in 2011, when its worldwide push communications network suffered total failure and didn’t return for four whole days. What became very apparent during this prolonged outage was that RIM didn’t understand what the problem was or how to fix it. This undermined its credibility severely, and by the time the issue was resolved its reputation was all but trashed in the business world.

The outage originated in RIMs datacentre in Slough but soon spread to the Middle East and Africa, and eventually progressed domino-like to Latin America, the US and Canada. Eventually, threequarters of the 70 million users couldn’t communicate using the BlackBerry messaging service or even receive email.

After days of complete silence, RIM eventually offered the excuse of a ‘core switch failure’, the language of which tried to calm the anger of its most valued customers. For whatever reason, it decided never to elaborate on that, and in many respects this was as much a communications failure as it was a software-based fiasco.

By way of compensation, BlackBerry customers were offered $100 of free applications for their phones, but the damage done to the brand couldn’t be fixed by token gestures.

At the time this happened, RIM was already coming under pressure from both Apple and Android, and the blackout only accelerated its market share decline. That it’s now on the verge of launching its own Android phone is a demonstration of how its fall from grace is almost complete.


Knight Capital Group


Trading on the stock exchange can be a risky business, because a breaking news story can wipe the value off stock very rapidly. But a distant event wasn’t the problem that Knight Capital Group had to contend with, when in 2012 its automatic trading software ran entirely out of control.

The trading day in New York started like any other, until at 9:30am EDT the software started to react in a way contrary to its usual logic. It decided that buying high and selling low was the new business model it would follow, and it started pushing these sorts of transactions through 150 different stock channels.

As a result, KCG managed to blow $440m in the next 30 minutes before it pulled the plug – four times the net income for the business in the previous year and nearly twice the market capitalisation of the company.

It’s been estimated that the code glitch led to four million extra trades in 550 million shares that would not normally have ever existed.

The software issue here was a routine called ‘Power Peg’, which was actually designed to test the software’s effectiveness at tracking a stock movement by artificially moving it up or down in the test environment. The flag for this was repurposed by one of the coders, forgetting that Power Peg routines still existing on the servers would read it and act accordingly.

The result was that the Knight Capital Group’s own stock value was devalued by 75% and resulted in the company being acquired by Getco LLC some months later by mutual agreement.

Many in the stock trading capitals of the world are concerned that automated trading will eventually cause a complete collapse of the world’s markets. And what happened to the Knight Capital Group only fuels the suspicion that a simple coding error might cause commercial chaos if allowed to go unchecked.

Heathrow Terminal 5

Heathrow Terminal 5


Anyone who’s travelled through an airport will soon realise that they’re very prone to single point of failure scenarios. Accept it or not, they’re like factories where the travelling public and their baggage is loaded at one end, and happy destinations are supposedly waiting at the other.

Obviously, new airports get to avoid the mistakes of the past – or so you’d reasonably think. That logic certainly didn’t apply to Heathrow Terminal 5 when it opened in March 2008 for the exclusive use of British Airways.

Not long after the Queen had officially opened it for business, it became apparent that the new terminal wasn’t working as planned, specifically in the area of baggage handling.

The software that had worked flawlessly in testing became unpredictable when confronted with typical scenarios where checkedin luggage needed to be retrieved prematurely because it contained something critical, like passports or medication.

It eventually collapsed completely, and over the following ten days of operation, approximately 42,000 bags failed to travel with their owners, and some 500 flights were cancelled entirely.

These problems delayed the transfer of British Airways’ long haul flights to the terminal and forced the company to spend on an advertising campaign to reassure the travelling public that the flaws had all been addressed. Those who travel through Heathrow on a regular basis might still contest that claim, I suspect.

The total cost of the mostly software-based problems that beset Terminal 5 have never been made public, but they probably run into tens of millions at the very least.

VW Emissions Testing

VW Emissions Testing


By any strict definition, the software that VW deployed in its diesel cars wasn’t faulty, from what we know. And, as they say, that’s really the crux of the problem.

This story, as all the best ones do, has some legs that take it back a couple of years and involve a non-profit organisation called the International Council on Clean Transportation (ICCT). It set out to bang the automotive drum about ‘clean diesel’ cars (if that isn’t a contradiction in terms) and chose to do this by working with some emissions researchers at West Virginia University.

They took three popular vehicles – a VW Jetta, a VW Passat and a BMW – which they tested in the lab and on the road for their emissions. Disturbingly, only the BMW passed the tests, and there was a huge discrepancy (40-fold on occasion) between what the VW vehicles achieved on a rolling road and a real highway.

As it’s legally bound to in the US, ICCT reported these findings to the EPA and the California Air Resources Board (CARB), both of whom approached VW for an explanation.

According to various sources, VW initially blamed the problem on a software bug and promised to fix it immediately with a voluntary recall.

Subsequent testing revealed that it wasn’t just these but other VW and Audi cars that exhibited this odd behaviour, and a year later they still did. Eventually, tired of VW’s prevarication, CARB declared that it wouldn’t be passing 2016 models to be sold, at which point the truth came out. VW had designed a ‘defeat device’ that could work out if the vehicle was actually driving or on a rolling road, by checking if the steering wheel was being moved.

If the vehicle sensed that the wheels were rotating but no steering input was being detected, it would enter a special testing mode, where it would run the catalytic scrubber up to full power. The downside of doing this is that it drastically affects both performance and fuel consumption, undermining the whole ethos of owning one of these ‘fuel efficient’ vehicles.

Subsequently, VW admitted that nearly half a million cars in the US and 11 million worldwide had this system for cheating emission tests, and the consensus is that it isn’t the only car maker who’s been doing this.

What’s mildly hilarious about this story is that when it broke, the head of the VW US business, Michael Horn, said, “we totally screwed up”. Er… no Michael. Screwing up is when you go to the supermarket and come back without eggs, whereas in this instance you set out to defraud your customers and emission control agencies, while intentionally damaging the environment. And you succeeded in this venture for about six years.

The Audi A3 and VW brands Passat, Beetle, Golf and Jetta are all affected.

An insurmountable problem for VW in this instance is that if it was just one division of its company affected, it could claim a rogue team or designer, but as this tech turned up in the Audi division too, it was clearly something that was managed at a much higher level.

The financial implications of this software choice are quite mind blowing and could potentially bankrupt the company – at this time the world’s largest car maker.

There are 482,000 vehicles in the US that carry the defeat mechanism, and under US statutes, VW could face a fine of up to $37,500 per vehicle, which would hit it in the wallet to the tune of $18bn. However, that doesn’t take into account fixing those vehicles, or any claims made against it for environmental damage and by the cars owners. And it also doesn’t include any of the approximately 10.5 million cars outside the US and how those owners and countries will react to these disturbing revelations.

VW shares have dropped nearly 60%, as shareholders have tried to assess the impact of these additional costs, which might wipe out profitability for some years to come. When the final bill for this comes in, if VW doesn’t vaporise under the legal barrage in the meantime, it will probably turn out to be the most costly software mistake ever made.

That said, it’s entirely possible that another company has done something even more horrendous, which we just haven’t heard about yet…