Data Quality: The bigger they are, the harder they fall

ao link

Members

Contact

Free AI assessment

New to DataIQ?

Take our FREE data literacy indicator now

Unlock the power of data - take our FREE data literacy indicator now

Big data could mean big data quality problems, unless you understand what you are doing. Nigel Turner, vice-president, information management strategy at Harte-Hanks Trillium Software explains how to follow the four Vs that will lead to success.

In the UK, 2012 will be remembered as the year of sport. Highlights have included the London Olympics, Bradley Wiggins winning the Tour de France (despite his wind-resistant sideburns), Murraymania at Wimbledon and England again blowing their Euro 2012 hopes in a penalty shoot-out. Another key 2012 trend is Big Data, so a sporting aphorism seems appropriate.

“The bigger they are, the harder they fall” was coined by the British boxer Bob Fitzsimmons. He was a world heavyweight champion, but was light enough to also hold the middleweight world championship. He said the phrase in 1900 before a fight when faced with a much bigger opponent. Fitzsimmons emerged victorious, proving that size doesn’t always count.

And so it is with Big Data. Its collection, collation, analysis and exploitation can lead to corporate success. If implemented badly, it can just as likely become an unfulfilled promise, with companies drowning in a muddy morass of disparate data.

It’s worth highlighting some basic facts about Big Data, as it’s a term that can cause confusion. In some respects, it is both new and disruptive - but is not a total revolution. It’s an evolution of existing trends in IT, in data growth and in data management. Big Data is hard to define precisely because it is an umbrella term for a set of concepts, practices and technologies to create, collect, store, process and exploit this data.

What is undoubtedly new is the way data is viewed. Traditional business intelligence (BI) selects certain data elements of particular value from the vast quantities of data produced, aggregates them and produces selected reports and insights. Big Data, in contrast, sees all data as being an asset, from what I ate for breakfast this morning to how many people turned off their heating in Bristol last Sunday when a rare summer day blossomed. So at the heart of Big Data is the idea that all data has potential for exploitation. The trick is how to make it happen.

It is easier to explain what this means in practice by looking at the characteristics of Big Data. The most common description involves the three Vs - Volume, Velocity and Variety - but is also sometimes classified as the three Ss - Scale, Speed and Scope:

Volume (or Scale) - whereas on average IT budgets may grow at best at 5 per cent a year, data volumes are growing by around 50 per cent a year. Big Data is in part a response to this conundrum. How can this data be managed and exploited more cost-effectively?
Velocity (or Speed) - Much of the data generated is in real or near-real time. This requires new techniques for collecting, sorting, storing, analysing and reporting on the data. To be of value, this needs to happen very quickly. So a key capability is to be able to track millions of events per second, analyse them and derive actionable information.
Variety (or Scope) - 80 per cent of the data generated today is semi-structured or unstructured. Traditional BI/data warehousing environments, built predominantly for structured data, are ill-equipped to deal with this. New approaches are needed.

Where is all this Big Data coming from? The mass of data derived from clickstream, cookies and so on is seen as providing insight into both the effectiveness of websites and the individual preferences and behaviour of users. Downloading and purchasing music online tells vendors a great deal about their customers’ likes and dislikes, so helping them to customise individual marketing. One rapidly growing source of Big Data is mobile devices - it is estimated that 4.3 billion are now in use and each generates a wealth of potential information on their users’ activities.

Another major growth engine has been social networking. Facebook had 1 million members in 2004 - today it has 901 million active members globally. These members generate over 15 terabytes of data daily. On Twitter, 200 million tweets are made each day, amounting to 12 terabytes of new data every 24 hours. If exploited appropriately, this data can be of great value to organisations.

There is also a third area of Big Data. This centres on the vast amounts of data which are increasingly being created automatically. Point of Sale terminals generate transaction-based information on customers and products. Walmart, for example, collects over 1 million customer transactions every hour. This data is streamed into massive data stores currently containing over 2.5 petabytes of data - that’s 160 times the data held in all the books in the US Library of Congress.

More data is being generated by the 30 billion or so RFID/QF tags now in use globally. And many machines are also generating vast quantities of data - the A380 Airbus is viewed as the world’s smartest plane as it is said to run on over one billion lines of code. It contains a large number of sensors and microprocessors, each monitoring and reporting on the health of its systems. For instance, each of its four engines generates 20 terabytes of data on its performance in flight every hour. This data is used to aid diagnostics and enable proactive engine maintenance. In the Utilities sector, the introduction of smart meters which constantly monitor and report on usage of gas, water and electricity are a further example of increasingly automated data generation.

Accessing and exploiting social network data is seen as a potential goldmine for marketers and others who want to better understand the customer in Retail and FMCG. In Insurance and Banking, traditional in-house data is being combined with social networking data as an increasingly powerful weapon against fraud.

Manufacturers are also increasingly using Big Data. Every Volvo has hundreds of microprocessors and sensors. The data generated is used by the vehicle itself, but is also captured for analysis by Volvo and its network of dealers. This data is loaded onto a central analysis hub and integrated with the company’s own CRM, dealership and product data stores. This enables Volvo (among other things) to spot design and construction flaws early, enable proactive correction of faults, and see how the vehicles respond in accidents.

This data explosion requires new approaches to handle it. Core is the development of new technologies specifically designed to process the volume, velocity and variety of Big Data. These flexible and scalable IT platforms are based on architectures of highly-distributed clusters of platforms and data stores with multiple servers working in parallel, each with its own processing capability. Sitting on these platforms is the software needed to split, sort, analyse and merge the data, most commonly Hadoop and MapReduce. These platforms are also complemented by new types of databases (usually classed as NoSQL databases) which offer lower cost non-relational database structures and more flexible and distributed schemas.

Big Data is also different as it requires novel approaches to data analysis. To meet this, the new specialism of “data science” is emerging - the comprehensive understanding of where data comes from, what data represents, and how to turn data into actionable information. This encompasses statistics, hypothesis testing, and predictive modelling. Many of the key tools and approaches of data science are not new, such as data mining and predictive analytics. But it is the ability to do this over large and diverse data sets at high speed that makes it different.

Finally the distributed nature of Big Data implies and requires that the insight generated needs to be made available to all who need it quickly. It needs to become both a by-product and an integral part of every operational business process. Big data means the democratisation of data.

Should your organisation invest in these Big Data capabilities? Evidence is beginning to emerge that early adopters of Big Data are reaping rewards. In a recent survey by Avenade of over 500 senior business and IT executives in 18 countries, 84 per cent claimed that their early Big Data initiatives had already helped them to make faster and better decisions, while 73 per cent said it had already boosted their revenues.

It is even being said that Big Data will boost economies as a whole. A report by the UK-based Centre for Economics and Business Research claims that Big Data will contribute an additional £216 billion to the UK economy over the next five years. It is also forecast to create 58,000 new jobs in the UK, not just through data science, but more widely through the new revenues it will generate.

So if it is worth investing in Big Data, how do you ensure that the investment you make reaps rewards for your organisation? There are four foundation stones that need to be put in place:

Identifying the right data to solve the business problem or opportunity. As with any data management challenge, it is vital to be clear about what the business wants to achieve - what is the business problem or opportunity that Big Data can help with? Start to seek out the best and most appropriate data sources to meet that need. Do not collect data simply because you can.
The ability to integrate and match varied data from multiple data sources. Much of the potential value of Big Data will rely on your ability to associate external data sources with your existing internal data sources. Here data quality will become a critical enabler - the higher the quality of your internal data the easier the challenge becomes.
Building the right IT infrastructure to support Big Data applications. You will need to invest in new tools and platforms to make Big Data work. But also remember to invest in the data itself - garbage in, garbage out is as true today as it ever was. Many external sources will need to be scrubbed and enriched before they can be integrated and exploited.
Having the right capabilities and skills to exploit the data. Ensure that you have the right people with the right skills in data science and general data disciplines to make Big Data happen. If not, start upskilling soon and look to supplement with expertise from third party organisations.

In the high volume, high velocity, high variety Big Data environment, managing the data will not be a trivial task, especially as data will be highly distributed and will change frequently. If this data management challenge is not met, many problems will inexorably and inevitably arise.

Note also how many of these enablers refer back to the old challenges of data integration and data quality. These will become increasingly important in the Big Data world. It’s therefore critical to:

Profile data sources to gain an understanding of their format and content. This is also vital to assess its fitness for purpose.
Have a clear policy and strategy as to how data will be managed within your Big Data environment. This should cover security, legal and regulatory adherence, retention and so on. Back it up with the definition, implementation and enforcement of the business rules applied to that data. In a widely-distributed environment, it’s also critical to communicate and share those rules and associated metadata.
Recognise that none of this will happen without clear ownership and leadership. Also, these data management and data governance responsibilities will extend outside the enterprise to a much greater extent than is often true today.
Big Data actually needs to be about four, not three Vs - Validity must be recognised as the fourth dimension of Big Data. Investing in Big Data without a concomitant focus on managing data quality and data governance is doomed to fail. The good news is that many of the data quality capabilities needed are available now.

As Bear Bryant, a US football coach, once said: “It’s not the will to win but the will to prepare to win that makes the difference.” Prepare for Big Data success by investing in data quality. In this 2012 year of sporting success, the winners will always be the best prepared.
(This article is based on a Trillium Software webinar delivered by the author in June 2012.)

Log in to read the entire article

Gain access to the entire article by logging in or registering for a free account here.

Did you find this content useful?

Thank you for your input

Thank you for your feedback

Next read

Data Literacy versus Data Culture – DataIQ’s view

DataIQ explains the differences between data literacy and data culture as understanding the differences is essential to achieve buy in and support from business leaders.

Next read

A case of the AI biter bit?

23 Apr 2024by David Reed

DataIQ’s Chief Knowledge Officer and Evangelist, David Reed, examines the hype cycle around generative AI and the actual speed of transformation being seen.

Pioneering AI initiatives revealed: DataIQ Announces 2024 AI Awards Shortlist

15 Apr 2024by Alex Roberts

The shortlist for the 2024 DataIQ AI Awards has been unveiled, with the winners to be announced at the DataIQ Summit on May 21.

Final chance to enter the 2024 DataIQ Awards and demonstrate your team’s prowess

08 Apr 2024by Alex Roberts

The final deadline for submissions to the 2024 DataIQ Awards – 26 April – is rapidly approaching, so make sure you have entered to clinch a title.

You may also be interested in

International collaborative AI safety agreement signed

DataIQ is a trading name of IQ Data Group Limited
10 York Road, London, SE1 7ND

We use cookies so we can provide you with the best online experience. By continuing to browse this site you are agreeing to our use of cookies. Click on the banner to find out more.

Cookie Settings