“Data science has delivered a change and it’s making a real impact on what we do. The ability for us to use it further is amazing. Data science is the answer,” said ONS chief technology officer Simon Sandford-Taylor. This was the summation of his presentation of how implementing data science has transformed the way the national statistics institute of the UK operates.
The ONS is a non-ministerial department that reports directly to UK Parliament, informing the decisions that Parliament makes. “We produce, we collect, we analyse, we disseminate, we publicise the data across a wide range of economic, social and demographic information,” said Sandford-Taylor, who added that the data comes from in-person surveys and interviews.
The office produces 650 releases every year including GDP, population and immigration statistics, price statistics - including consumer price inflation based on the basket of goods - social statistics on neighbourhoods and crime, labour market statistics and vital event statistics such as births, deaths and marriages.
With all the detailed statistical analysis taking place on all those figures, the British public is mainly interested in one thing: baby names. For the record, the most popular names in 2016 were Oliver and Olivia. Sandford-Taylor added that he and his team even created a presentation on the number of Game of Thrones-inspired names appearing in the top 100.
However, the ONS has more serious challenges. There are almost 650 systems designed around the surveys, statistics and outputs, so the data is largely held in silos. In addition, there is a high volume of data. In the past, the team was dealing with 100,000 or 200,000 pieces of survey-based data, but that has now gone up to the millions. “If we use our traditional methods to do the statistics, it could take weeks or maybe months,” said Sandford-Taylor.
"There's a lot of data out there in government...and all of it is available to us."
They also have many reviews to check how the ONS is operating; a recent one was the 260-page Independent review of UK economic statistics, led by Professor Sir Charles Bean, published in 2016. His conclusion was the ONS could be doing more with data.
“There’s a lot of data out there in government. Business transacts with government, citizens transact with government quite a lot and all of this data is available to us through the Statistics Act,” he said. As a result, the ONS decided to start implementing data science.
The ONS had already established the Data Science Campus in March 2017 to train people in data science, as well as a university course and an enterprise architecture diagram. The diagram visualises the data going into the data management layer. It then undergoes processing analysis and is finally pushed out for dissemination on the website.
Sanford-Taylor said he knows there are lots of good open source technologies for the above processes, such as Hadoop and MapReduce. However, his staff did not understand all of those technologies and did not have the time to work out which versions worked together.
Cloudera turned out to be answer and is now used as the location for storing and managing ONS data, where the processes are run. “We really needed to find someone who could package that all up for us and deliver it. Their offering gave us a pre-configured environment that we could use and work around our entire data platform.”
The data sources are surveys as well as web public data sites, API pools, private clouds and other government departments. It comes in several formats - ADI web transfers, file transfers, emails with attachments, as well as physical devices. The outputs are data flow management, file storage, data files and manifest metafiles.
The ONS uses a range of tools to process and explore the data, including Python, Spark, R, and SaaS. Sandford-Taylor and his team are also in the beta programme of the Cloudera Data Science Workbench.
"We've really radically shown what can be done with bigger data."
With this new methodology in place, Sandford-Taylor and his team were able to follow the recommendation of Sir Bean and “do more with data.” One way in which they did this was to include VAT turnover data from HMRC in the production of GDP. “None of this would have been possible with previous methods,” said Sandford-Taylor. “What effectively we’ve done is really radically shown what can be done with bigger data and how it can affect our output and what it does to the whole process.”
With the automation of a number of the processes, some of the 450 people in the data collection unit who check whether the values are correct can be redeployed. The ONS can use that resource to explore new data sets and see how the data can give more insight so that they can provide better information for better decisions.
By doing so, the ONS can be, in the words of Sanford-Taylor, less like a factory and more like a consultant.
Simon Sandford-Taylor was speaking at Cloudera Sessions London.