Bad data is poorly structured, mis-formatted and plain ugly. It's a problem any analyst will be familiar with and one that never seems to get fixed for good. But could a new generation of tools, including those being developed around open source, not-for-profit models, hold the answer?
Open Knowledge International, a NFP organisations, believes it has created just such a solution to how data should not be presented. According to Vitor Batista, a lead engineer, there are four different types of validation that a data set should go through before analysis. Basic validation makes sure that the file can be loaded or opened and if it has been corrupted. Structural validation checks if it is a proper table, if all the rows have the same number of columns, and if there are any blank rows or columns. Content validation looks at testing the content itself and so checks if data sits within constraints, for example numbers being greater than a certain figure. Advanced checks allow the user to create custom checks for their data set.
One tool that can carry these out is Good Tables, a free website that allows users to validate tabular data and check that it is good to go. Batista explained that it is intended to be used with Tableau and can validate spreadsheets like CSV and Excel files.
Good Tables can also make sure that data fits a specific schema. But what is that exactly? According to the IBM dictionary of computing, it is a centralised repository of information about data, such as meaning, relationships to other data, origin, use and format.
"Data validation is super-useful if you are a producer or consumer."
Batista said: “Data validation is super-useful if you are a data producer or data consumer, which are usually the same people.” He placed Good Tables in the context of a project that Open Knowledge International has been working on called Frictionless Data.
Batista explained that this concept is about the workflow of getting data from the source to the starting point of analysis. “We think there is a lot of friction considering the data quality is there. The data quality itself is not the whole story. I need metadata, I need to know the data dictionary, the license of the data, where it comes from, what’s the source, who’s the author. This kind of information needs to be together with the data so you can understand it,” he said.
"Friction stops us getting insight and solving important problems."
Rufus Pollock, the president and founder of Open Knowledge International, in a video explained that a lot of time is spent collecting and preparing the data, leaving very little time to turn it into insight. He said: “This friction stops us getting insight and solving important problems.” The aim is to eliminate the friction of getting data from tool A to tool B.
Pollock said they can put data into data packages, something akin to shipping containers in the physical world. They take data from a spreadsheet and put it inside a virtual package, making it more efficient to load the data into the tools that users already have. With the data inside a data package, users can validate the data automatically, store and search it in standard ways, import it to their specific tool or export it from the tool automatically.
"We can start moving data around between the tools without losing information."
Batista concluded by saying: “When I have this common language, the tools communicate between each other so we can start moving data around between the tools without losing information from the data, without losing this metadata.” He explained his long-term vision for Good Tables by saying: “The idea is to build materials for people working in data who are technical, but are not developers.”
Vitor Batista was speaking at the offices of the Open Data Institute. The ODI has made a recording of the presentation available.