Date:
29 August 2016
Author:
Salsa Digital

#Census Fail

Our first response to what was quickly known as #CensusFail was to draft a blog straight away, to get out our opinion on what went wrong on Census night. However, we took a breath and realised it would be more prudent to wait until more details came out. Interestingly, not much has surfaced since the first few days after 9 August. Maybe that’s got something to do with our theories on what actually happened.

Digital transformation

First off, we want to take our hats off to the ABS and everyone involved in the decision to put the Census online this year. The monumental problems (and the resulting chaos) don’t change the fact that an online Census is a great example of digital transformation in government. If we go back to our first ever blog in this series, our definition of digital transformation was “substantive (or maybe even complete) change that uses computerised technologies including the Internet.” An online Census undoubtedly delivers on this front. (Although there is also an argument for whether we even need a Census with so much real-time data out there now!)

It’s also fitting before we focus on the problems to quote The Mandarin’s Tom Burton, “If at the first hurdle we resort back to an old fashioned lynching mob, then you can say goodbye to any real public sector innovation out of Canberra for another decade.” (We highly recommend his blog on five lessons from #CensusFailExternal Link .)

However, as much as we applaud the intentions of the government, it’s also apt that Salsa Digital, as an experienced technology service provider with many government clients, provides an analysis of the site and what we think went wrong.

DDoS attack? Mmm…

The morning after the Census site crashed (and was subsequently pulled down) it was no surprise to see a variety of politicians providing an account of what went wrong. SBS provided a great timelineExternal Link the morning after. Some key points are:

10.08am (AEST) – traffic spike for 11 minutes which causes five-minute system outage.

11.46am – another spike “consistent with a second denial of service”.

11.50am – denial of service mitigation plan activated and international traffic blocked.

4.58pm – “modest” increase in traffic

6.15pm – small-scale DoS attempted but stopped.

7.30pm – “significant” DoS detected (coinciding with increased logons by Australian public)

7.45pm – ABS shuts down the online forms.

This is the official ‘line’ and our research hasn’t unearthed any contrary reports from the government or the service providers involved in the Census site.

However, our two in-house tech gurus have got a different story to tell.

Our view

One of our engineers, Ivan Grynenko, took a step-by-step approach to how the job should have been handled in the first place and what actually happened to the site on 9 August. But before we look at the best way to run a project of this magnitude, let’s dissect the timeline the government provided.

Ivan found some of the timeframes too coincidental. For example, the first ‘attack’ occurred around 10am, a time when you could reasonably expect some sections of the population to login (e.g. shift workers, retirees, non-working mums after the school run, and others who don’t work). The second ‘attack’ was around lunch time, the third around the time many people would be finishing work and the fourth and final ‘attack’ initiated when the majority of Australia’s east coast population would have been at home (perhaps just before dinner for those without kids and just after dinner for those with kids). Sounds a little too coincidental that all the so-called ‘attacks’ occurred at the exact times you’d expect a surge in traffic. When did you submit (or try to submit) your Census? Chances are it was during one of these peak times.

“And if it was a DoS or distributed denial of service (DDoS) attack, these types of attacks have a very specific pattern and can be blocked without affecting legitimate traffic,” said Ivan.

Our research for this blog found others also questioning the DDoS theory. An article in theExternal Link Sydney Morning HeraldExternal Link stated that: “Some online have speculated that the claims of a DDoS attack are merely cover for the fact that the ABS was unprepared for so many Australians to use the service at once.”

Certainly the system capacity was something that rang alarm bells for Ivan. “The system capacity was quoted as being able to handle only 260 forms per second. Let’s assume 15 million forms in total need to be submitted. With the rate of 260 per second (ongoing, no pauses) it will take 16 hours to submit 15 million forms. Taking timing into account and the fact it was a working day, it’s safe to assume the majority of the traffic would come around 9am-10am, then 12pm-2pm and the rest (biggest spike) around 6pm-9pm. So most of the population would log on during these six hours. If we squeeze 16 hours into our projected six hours, this comes to a required system capacity of 693 forms per second. However, given it’s impossible to predict exact times for traffic spikes, I’d expect the system to handle double (ideally triple) this number (around at least 1,400 forms per second) plus a safety margin (let’s say 20%), which would give us a required system capacity of 16,800 forms per second — which I’d round up to 17,000.” (See below for more details and projections based on this 17,000 figure.)

“When systems melt down, increasing traffic of up to 17,000 form submissions per second creates a hardware resources shortage and the network traffic might look like a DDoS attack (but isn’t one).”

Salsa Digital Senior Developer Kurt Foster also feels system capacity was one of the main issues. “The load testing was below what they should’ve reasonably expected as the peak load, and well below what they should’ve tested to include a buffer on top of that.” Although it’s not real world, you might want to read about the scalable system two students built for the ABS site (the articleExternal Link explains an approach to this load issue with a bit more thinking behind it).

The best way to run a project of this size

Without knowing the full details of exactly who was engaged to deliver this site (in terms of the technical experts behind the companies), a site of this magnitude requires performance specialists who understand performance planning, performance SLAs, load tests indicative of peak loads, stress testing to identify system breaking points, instrumentation to measure system utilisation levels and user response times, and so on. Only then could a conclusion be confidently drawn on the all important question — how well will the site perform? On the expertise,Ivan suggested the site needed an expert who had worked on extremely high-traffic sites: “We’re talking an engineer from a site like Google or Facebook.”That said, performance testing and validation at this 'enterprise level' is both an art and science, which is not to be underestimated.

The media has also touched on the government’s procurement practices, and IBM as a service provider. (It’s certainly interesting to read this Sydney Morning Herald article from April this yearExternal Link in the context of Census night.) Salsa Digital Senior Developer Kurt Foster also questioned: “...the perception that it's better to hire a big global company rather than looking for that expertise within Aussie companies or even recruiting a very high level player to run the project.”

Another option would have been to alter the way the Census is delivered, perhaps spreading out the Census submissions. This would have been a logical move, given the size of our population and the likelihood that most people would logon sometime before/after dinner. The ABS should have looked at running the Census on a state-by-state basis, or giving people a week to login and submit their forms online. This certainly would have been a preferable outcome to building a system only capable of handling roughly 1.5% of what Salsa Digital estimates the system capacity should have been.

The impact going forward

Unfortunately, #CensusFail is probably going to have massive negative repercussions for digital transformation in government. Australian National University politics lecturer Andrew Hughes told The New DailyExternal Link that the Census debacle has put online voting back for at least another 10 years. It’s certainly eroded the public’s trust in the government’s ability to handle large-scale digital projects.

However, it’s also possible for the government and public to turn this around, to use this to improve our digital solutions.

Key learnings

The Mandarin’s Tom Burton came up with five key lessonsExternal Link :

  1. Learn well

  2. Risk management

  3. Accountability in a digital era

  4. Census in a world full of data

  5. Digital arrogance

Certainly the government should take these lessons on board and ideally the public should also embrace the ways in which we can learn from this experience.

James Riley in an article for InnovationAus.comExternal Link compares #CensusFail to the launch of Barack Obama’s healthcare.gov website in 2013 (in one day only six people managed to sign up online). Riley discusses how the US Government learned from this mistake and suggests Australia learns from the Census. He singles out technology procurement in government as an area that needs to change and also says: “...there must be changes to how government attracts private sector skills into the public service on projects that matter. The Census 2016 was a painful and expensive experience. But the reality is that it arrived just in time. It should provide momentum for change.”

Will we ever know what really happened?

There are currently two investigations underway into what happened on Census night. One is by the chair of the Productivity Commission Peter Harris (who interestingly told The MandarinExternal Link that “it’s a digital revolution, and all revolutions involve risk…”). The second investigation is being run by the Prime Minister’s Cyber Security Adviser Alastair MacGibbon. There was some speculation that initial findings would be released last week (one of the reasons we wanted to wait before weighing in with our analysis) but so far it’s all quiet. We wait, with interest, for the findings. Will the government continue to claim DDoS attacks? Will their expert findings actually discover the system simply wasn’t built to handle the fact that most of the population would logon between 6pm and 9pm?

Salsa Digital’s take

We applaud digital innovation and support Peter Harris’s comment that digital revolution involves risk. We’re also hoping that the government will move for transparency and learn from its mistakes…and that the public will understand that in today’s digital age, mistakes will be made but we can move forward if we learn from them.

Footnote: The numbers and DDoC attacks (for the tech heads!)

The 17,000 submissions per second references the number of POST requests (form submissions require a POST request type and can’t be cached) and requires a Census code to access it. It would be nearly impossible for someone getting this number of valid Census codes to initiate the valid traffic levels as an attack in the first place.

DoS (and distributed DoS) attacks use specially crafted requests, that way the request never completes and thus permanently occupies a tiny bit of server resources. It is not correct to refer to legitimate traffic as a DDoS attack traffic. These attack requests could be easily identified and dropped, retaining the legitimate traffic. Most modern firewalls not only protect websites against DDoS attacks at layer three and four, but also at layer seven – the application layer, dropping malicious requests and applying firewall blacklisting to the offending IP addresses.

Let’s attempt to project the estimated 17,000 requests per second of legitimate authenticated POST requests to a number of specially crafted requests, required for a successful DDoS attack. The reason of using GET requests for the attack is simple: to initiate a successful attack via POST request, valid Census IDs are required. Assuming one POST request is equal to 100 GET requests in terms of hardware resource requirements, the number would grow to a whopping 1.7 million requests per second (approximately). These 1.7 million requests per second should be initiated from a number of compromised computers and servers around the world and would require several times more servers and computers to be successful. This is to allow for ABS firewalls blocking the offending IP addresses. We assume again that the lack of DDoS protection measures was not known publicly and the malicious party expected some sort of effective DDoS protection from an Australian government website.

Let’s come up with a factor of 10 (it should require much more for a successful DDoS attack, but let’s stay with x10). This incident would require 17 million compromised computers and servers around the world to initiate a more or less successful DDoS attack of this size to turn down a system tuned to handle 17,000 POST requests per second. In addition, running a DDoS attack (or any attack) means exposing the malicious IP addresses.These exposed IP addresses immediately go to public blacklists, so the attacker systems cannot be used more than once. Financially it’s not feasible to run a DDoS attack of this size when there’s no obvious financial or political profit.

It’s much more likely these numbers represent the Australian population submitting their online forms using their Census access codes for authentication.

Back to top

Subscribe to DTIG

Subscribe to our Digital Transformation in Government series to keep up with how technology is transforming government. 

Subscribe