Big data could mean big data quality problems, unless you understand what you are doing. Nigel Turner, vice-president, information management strategy at Harte-Hanks Trillium Software explains how to follow the four Vs that will lead to success.
In the UK, 2012 will be remembered as the year of sport. Highlights have included the London Olympics, Bradley Wiggins winning the Tour de France (despite his wind-resistant sideburns), Murraymania at Wimbledon and England again blowing their Euro 2012 hopes in a penalty shoot-out. Another key 2012 trend is Big Data, so a sporting aphorism seems appropriate.
“The bigger they are, the harder they fall” was coined by the British boxer Bob Fitzsimmons. He was a world heavyweight champion, but was light enough to also hold the middleweight world championship. He said the phrase in 1900 before a fight when faced with a much bigger opponent. Fitzsimmons emerged victorious, proving that size doesn’t always count.
And so it is with Big Data. Its collection, collation, analysis and exploitation can lead to corporate success. If implemented badly, it can just as likely become an unfulfilled promise, with companies drowning in a muddy morass of disparate data.
It’s worth highlighting some basic facts about Big Data, as it’s a term that can cause confusion. In some respects, it is both new and disruptive - but is not a total revolution. It’s an evolution of existing trends in IT, in data growth and in data management. Big Data is hard to define precisely because it is an umbrella term for a set of concepts, practices and technologies to create, collect, store, process and exploit this data.
What is undoubtedly new is the way data is viewed. Traditional business intelligence (BI) selects certain data elements of particular value from the vast quantities of data produced, aggregates them and produces selected reports and insights. Big Data, in contrast, sees all data as being an asset, from what I ate for breakfast this morning to how many people turned off their heating in Bristol last Sunday when a rare summer day blossomed. So at the heart of Big Data is the idea that all data has potential for exploitation. The trick is how to make it happen.
It is easier to explain what this means in practice by looking at the characteristics of Big Data. The most common description involves the three Vs - Volume, Velocity and Variety - but is also sometimes classified as the three Ss - Scale, Speed and Scope:
Where is all this Big Data coming from? The mass of data derived from clickstream, cookies and so on is seen as providing insight into both the effectiveness of websites and the individual preferences and behaviour of users. Downloading and purchasing music online tells vendors a great deal about their customers’ likes and dislikes, so helping them to customise individual marketing. One rapidly growing source of Big Data is mobile devices - it is estimated that 4.3 billion are now in use and each generates a wealth of potential information on their users’ activities.
Another major growth engine has been social networking. Facebook had 1 million members in 2004 - today it has 901 million active members globally. These members generate over 15 terabytes of data daily. On Twitter, 200 million tweets are made each day, amounting to 12 terabytes of new data every 24 hours. If exploited appropriately, this data can be of great value to organisations.
There is also a third area of Big Data. This centres on the vast amounts of data which are increasingly being created automatically. Point of Sale terminals generate transaction-based information on customers and products. Walmart, for example, collects over 1 million customer transactions every hour. This data is streamed into massive data stores currently containing over 2.5 petabytes of data - that’s 160 times the data held in all the books in the US Library of Congress.
More data is being generated by the 30 billion or so RFID/QF tags now in use globally. And many machines are also generating vast quantities of data - the A380 Airbus is viewed as the world’s smartest plane as it is said to run on over one billion lines of code. It contains a large number of sensors and microprocessors, each monitoring and reporting on the health of its systems. For instance, each of its four engines generates 20 terabytes of data on its performance in flight every hour. This data is used to aid diagnostics and enable proactive engine maintenance. In the Utilities sector, the introduction of smart meters which constantly monitor and report on usage of gas, water and electricity are a further example of increasingly automated data generation.
Accessing and exploiting social network data is seen as a potential goldmine for marketers and others who want to better understand the customer in Retail and FMCG. In Insurance and Banking, traditional in-house data is being combined with social networking data as an increasingly powerful weapon against fraud.
Manufacturers are also increasingly using Big Data. Every Volvo has hundreds of microprocessors and sensors. The data generated is used by the vehicle itself, but is also captured for analysis by Volvo and its network of dealers. This data is loaded onto a central analysis hub and integrated with the company’s own CRM, dealership and product data stores. This enables Volvo (among other things) to spot design and construction flaws early, enable proactive correction of faults, and see how the vehicles respond in accidents.
This data explosion requires new approaches to handle it. Core is the development of new technologies specifically designed to process the volume, velocity and variety of Big Data. These flexible and scalable IT platforms are based on architectures of highly-distributed clusters of platforms and data stores with multiple servers working in parallel, each with its own processing capability. Sitting on these platforms is the software needed to split, sort, analyse and merge the data, most commonly Hadoop and MapReduce. These platforms are also complemented by new types of databases (usually classed as NoSQL databases) which offer lower cost non-relational database structures and more flexible and distributed schemas.
Big Data is also different as it requires novel approaches to data analysis. To meet this, the new specialism of “data science” is emerging - the comprehensive understanding of where data comes from, what data represents, and how to turn data into actionable information. This encompasses statistics, hypothesis testing, and predictive modelling. Many of the key tools and approaches of data science are not new, such as data mining and predictive analytics. But it is the ability to do this over large and diverse data sets at high speed that makes it different.
Finally the distributed nature of Big Data implies and requires that the insight generated needs to be made available to all who need it quickly. It needs to become both a by-product and an integral part of every operational business process. Big data means the democratisation of data.
Should your organisation invest in these Big Data capabilities? Evidence is beginning to emerge that early adopters of Big Data are reaping rewards. In a recent survey by Avenade of over 500 senior business and IT executives in 18 countries, 84 per cent claimed that their early Big Data initiatives had already helped them to make faster and better decisions, while 73 per cent said it had already boosted their revenues.
It is even being said that Big Data will boost economies as a whole. A report by the UK-based Centre for Economics and Business Research claims that Big Data will contribute an additional £216 billion to the UK economy over the next five years. It is also forecast to create 58,000 new jobs in the UK, not just through data science, but more widely through the new revenues it will generate.
So if it is worth investing in Big Data, how do you ensure that the investment you make reaps rewards for your organisation? There are four foundation stones that need to be put in place:
In the high volume, high velocity, high variety Big Data environment, managing the data will not be a trivial task, especially as data will be highly distributed and will change frequently. If this data management challenge is not met, many problems will inexorably and inevitably arise.
Note also how many of these enablers refer back to the old challenges of data integration and data quality. These will become increasingly important in the Big Data world. It’s therefore critical to:
As Bear Bryant, a US football coach, once said: “It’s not the will to win but the will to prepare to win that makes the difference.” Prepare for Big Data success by investing in data quality. In this 2012 year of sporting success, the winners will always be the best prepared.
(This article is based on a Trillium Software webinar delivered by the author in June 2012.)