“It’s no use making decisions if you’re using data that just isn’t relevant. You’ve got to consider the right data that is valuable to the organisation and to the decision being made.” Combined with data sharing and data integration, the issue of data value is one of three key focus areas for Dr Philip Woodall, a senior research scientist specialising in industrial data management at the Institute for Manufacturing, part of the Department of Engineering at the University of Cambridge.
An ncreasing number of organisations have been approaching the Institute for Manufacturing wanting to know how to solve their data management problems. As a result, managing industrial data has moved into the centre of his practice.
“Many of us know the problems where you have data silos and the data is not connected between systems. If you connect it, you get all sorts of more valuable data and perhaps you’d solve some of the data quality issues,” he said.
Woodall has looked at research on data quality problems and found that many issues arise due to the way data is used. He draws a distinction between data reuse and data repurposing. “If we look at reuse of data, that’s using it for the same task you originally had in mind when you collected the data,” he said. On the other hand, data repurposing is using that data for some other purpose.
“Data quality is fitness for use. If you change the use, it may not be so fit anymore.”
He said that often data analysts and data scientists are using repurposed data, getting data from other people and using it to make decisions without knowing the assumptions that were made when that data was being collected. “We’re saying that data quality is defined as fitness for use, so if you start changing the use, that fitness may not be so fit anymore,” Woodall explained.
He and his colleagues did a survey of different manufacturing organisations and found that many were taking the data they collected to manage their processes and were repurposing it to improve those processes. They were taking the data from ERP systems, manufacturing execution systems, as well as inventory management and document management systems.
One example of repurposed data leading to poor data quality occurred with a colleague of Woodall, a data analyst who was working with a transportation company. He was asked to analyse a spreadsheet, but was puzzled to see that expected delivery dates and actual delivery dates were identical.
"Other people can actually put problems in for you along the way.”
When he asked what had happened, staff at the company said: “We just copied and pasted the data over to that column for you.” When he asked why, the response was, “we thought you wanted more data.”
Said Woodall: “Never mind the inherent problems in the data. We ourselves and other people can actually put some nicer problems in for you along the way.”
Problems with data quality can also arise when aggregating or disaggregating data. If a warehouse takes collection of a box with 100 items in it, one data inputter may count it as one, while another may count it as 100.
Woodall and his fellow researchers created a framework that grouped the causes of data quality issues into three different channels. The first is the "no data" channel, where you can’t get the data because you don’t know it exists. The second is the data warehouse channel where a lot of issues arise from the transformation of data during the ELT process. The third is where third parties, such as external analysts or consultants, are making decision on the data unaware of how it was collected.
“Astronomical figures can be wasted through something trivial and that’s what data quality is about."
The industrial data specialist said that items or units being defined in different ways depending on who was inputting the data can lead to poor data quality. He said that if sports shoes are described as "trainers" on one website, but as "sneakers" on another, it can be very difficult to reconcile and understand that the two parties are talking about the same thing.
The most memorable example of this is NASA’s Mars Climate Orbiter Mission of 1999. A disaster investigation board found that the $125 million orbiter burned up upon entering the atmosphere of Mars because engineers had not converted measurements from imperial to metric.
“It’s astronomical figures that can be spent and wasted just through something trivial like that and that’s what data quality is about,” said Woodall. “It’s about a lot of these seemingly trivial things creating an enormous business impact and huge consequences.”
"80% of the time in a project is spent wrangling the data."
Data quality is of paramount importance because when it is poor, data scientists and data analysts can spend an inordinate amount of time housekeeping before getting to the “fun bit” of putting the data into an analysis tool or doing machine learning on it to get the insight.
“You’re messing around getting the data in the right form and they quote figures of 80% of the time in a project is spent wrangling the data,” warned Woodall.
Dr Philip Woodall was speaking at BCS – The Chartered Institute for IT.