Big data is already in your organisation, you just may not realise it yet. If you have noticed the elephant in the room, you might want to see whether it can fly. That is the ambition of the big data movement, but be wary of what you are handling, writes David Reed.
Big data has been big news recently. From The Economist to McKinsey, it has attracted heavyweight interest of the sort that usually suggests a major trend is underway. It even has its own video - enter the term in YouTube when you have a spare minute. You’ll know when you have found it - it features an elephant on a trampoline. Although too long and slightly overdone, it does serve as a useful visual metaphor of what the big data movement is all about - getting something large and unwieldy to become agile and interesting.
So is big data the next elephant in the room that any organisation involved in data management will soon be talking about? Or is it a unicorn, appearing to have magical properties but actually more of an illusion than a reality? For anybody who lived through the hype cycle around CRM, much of the excitement about big data will feel familiar. It will also set off alarm bells. As when dealing with any wild animal, caution is advisable.
To understand why big data is different from conventional data management, look no further than PayPal. With 94 million active accounts and daily transactions worth £50 million, it has significant issues of scale to deal with already. Spotting potentially fraudulent payments while ensuring smooth checkout procedures is a major challenge. Yet the company has been handling all of this through the enterprise data warehouse it built in 2004. This is actually a shared platform, with eBay using the other half of its processing power.
From an original 3 terabytes, transactional data volumes have grown to 250tb. What the company now wants to do is drill into customer behaviour and behavioural analytics, areas that it sees as a challenge within the EDW. “We have all this information available and want to understand and use it in a closed feedback loop to deliver more value to the customer. That is part of a power shift,” says a source within the business.
The question PayPal is asking itself is how best to do that. With data flowing in from clickstreams and social sources, “the question is whether to manage that in a big data solution or our existing EDW.” Existing business users of the data warehouse need to be sure their service levels are met without disruption from the new data mining activity.
“Behavioural analytics is not a passing fad - it is core to the business. The current level of support won’t be nearly enough,” says the source. It has already adopted an open analytics platform approach to allow its 300-strong analysts to explore these new data sets freely. “The business may not be able to tell you what it will do with those insights, but as soon as it figures out how to deliver it, they will be using it.”
One outcome might be the introduction of a new analytical environment based around the core big data tool Hadoop. First developed ten years ago within Yahoo!, it was handed to the Apache Software Foundation in 2009 which has been co-ordinating the developer community around this tool. Alternatively, PayPal may may extend its existing Teradata data warehouse environment using new architecture tools like Aster Data.
Decisions like this will have a big impact on the way any business which relies on data is able to operate over the next ten years. And it is not something that can just be ignored. As the PayPal source says: “Behavioural analytics is where financial analytics was five years ago.”
Yet it is not easy to choose how to deal with the big data question, as PayPal’s deliberations make clear. Part of the reason is Hadoop’s open source nature - that reduces the commercial imperative for major data management providers to provide solutions. Most of the action is around appliances - pre-packaged combinations of the Hadoop framework with analytical software and hardware capable of handling high-volume processing. Just last October Oracle introduced its Big Data Appliance, for example.
Just as you wouldn’t bring a real elephant into the boardroom, so it might not be appropriate to bring the big data elephant into the existing data warehouse. That the elephant exists, however, is undeniable. “There is an explosion of data from less conventional sources,” points out Nick Millman, senior director, Accenture Information Management Services UK & Ireland. “There are questions about how organisations are going to deal with the increased complexity and number of sources, many of them outside of the boundaries of the organisation.”
This is a defining feature of big data which makes conventional data management approaches less capable. Behavioural analytics is undoubtedly the big money play out of big data, but it demands incorporating data flows from third parties, especially via APIs with social networks. That data may be incomplete or partial in coverage, which challenges classic processes.
“Organisations are going to need a hybrid of approaches. The traditional data warehouse still has a role to play. As you start to look at sources like blogs and social networks, they don’t lend themselves to relational database storage and analytical approaches,” says Millman. The power of applications specifically architected for big data around Hadoop or NoSQL is that they can still return valuable insights from much sketchier data inputs.
The return on investment may not be obvious, since data mining in this space is all about looking for patterns that may have an importance for the business, rather than testing hypotheses. Millman notes that, “it will be a challenge to build the business case. You don’t know what you don’t know.”
Some of those unknowns can at least be assumed. In financial services, for example, fraud is a constant problem. More data means better fraud detection. “The value of an insight is when you see that customer A is more likely to make a fraudulent claim than customer B,” says Millman.
This is where the behavioural analytics approach could really score. The Association of British Insurers recently revealed that fraudulent claims on pet insurance rose to £2 million in 2010 from £460,000 in 2009, in some cases involving claims on pets that did not even exist. Social network data analysis may yet reveal who is - or is not - a pet owner during claims handling by checking for blog references to an animal. That is a small-scale example for an approach with a much wider application.
On the other side of the business case, Millman says that cost-savings will need to be identified through better information lifecycle management. “If you just keep storing data, you need a huge infrastructure, so you need an ILM strategy to retire records.” he says. Add this to the list of things about big data that organisations are likely to struggle with because, as Millman points out, “ILM has not been something organisations are good with.”
If you are looking for a convincing business example that is driven by big data, albeit of the high-volume, conventionally structured variety, then Catalina Marketing in the US is a prime candidate. It is the market-leading “at till” couponing provider, driving out promotional offers via the till receipt for major FMCG brands and large grocery chains.
At the heart of its model is market basket analysis. “It takes millions of shopping baskets every day and analyses what people bought and in what combinations, looking to identify patterns,” says Dai Clegg, business analytics product marketing leader, EMEA, IBM.
Using a Netezza appliance to process and score these till returns, Catalina is able to automate triggers through its system that are executed at point of sale. In the past, this used to mean offering a promotion based on a propensity score. Redemption rates were typically around 2 per cent.
“Now it can spot what the most likely thing is that somebody hasn’t bought, so instead of just getting an offer, shoppers get a recommendation based on clever mathematical algorithms using large volumes of data about shopping patterns. Redemption rates have soared to 25 per cent,” says Clegg. “The only reason it works is because they have got an effective solution for market basket analysis.”
To big data purists, structured data around customers, products and sales does not qualify. But the scale of the activity carried out by Catalina Marketing - it tracks 75 per cent of all US shopping and runs three-quarters of a petabyte through its Netezza appliance - makes it hard to argue that its data challenges are big.
Indeed, it was as a result of performance issues in its data warehouse that Catalina first deployed Netezza. Scaling of data may be one reason why so many companies are suddenly getting interested in the subject of big data. As they do so, vendors are realising the need to support this new environment with more out-of-the-box functionality. IBM has now pre-integrated Netezza appliances with SPSS so that analytical models built in that application can be more easily written back onto the data sets needed to execute them, for example.
New business opportunities are another driver of the big data movement. Pay-as-you-go car insurance is one example. Despite an abandoned trial by Aviva several years ago, deploying a tracker into customers’ cars has been picked up by the likes of Insurethebox and Coverbox. Using telematics, personalised insurance premiums are offered based around a 6,000 mile a year allowance (which can be topped up) and drivers get a dashboard showing the information, including their driving style profile. For younger drivers especially, this is proving to be the solution to sky-high insurance costs driven by simplistic actuarial models.
Richard Kellet, director of marketing for SAS UK, believes that such new business models should be a reminder to all organisations of why they need to consider big data. “You need to change the business,” he says. Just investing in scaled up processing and analytics is not enough - the internal process for deploying data in decision making has to change. “Otherwise you will spend a lot of money on a lot of kit and get little value.”
“Big data has to be about looking to exploit different types of data. If you just assemble more data in your database, it is just a bigger database,” he says. That may mean adopting a breakout model which accepts that things have changed.
“I have been talking to a lot of CIOs over the last six months. They have been delivering a data strategy for their organisation based on having a single source and fountain of all knowledge. To enable this new world of big data, they have to throw that model away and integrate new sources they don’t own, that have different levels of quality and perceived value. That is very discomforting,” says Kellet.
To add to their unease, he underlines that “big data in its own right is pointless without achieving high impact outcomes.” Changes to data management have to be accompanied by business changes, too. Some of those are already underway, such as the move towards evidence-based decision making.
Another issue will be the willingness of a business to adopt a more “test and learn” approach. The nature of big data sources means data scientists just have to go and look, then try out a model, often repeating the process multiple times before emerging with something of significant value. Like Clegg, Kellet points to the way Catalina Marketing now works, applying SAS to its model building and driving out new models in 60 seconds, rather than five hours. “Its modellers have gone from delivering 40 to 50 operational models per year to around 600,” he notes.
Kognitio has been driving similar performance improvements as “business users are seeing that they need to do more with less,” says Sean Jackson, vice-president of marketing. “They want their IT colleagues to help them do whatever they need, whether that is a marketing programme, revenue assurance or pricing.”
Each of those lines of business is sitting on sizeable data volumes that have typically not been usable in the past, or which were only analysed using sample extracts. “They know the answers are in there, somewhere,” says Jackson. One feature of the big data movement that he highlights is a tendency to look for support outside of the conventional data management environment, which may also mean outside of the business itself via a hosted solution.
At the same time, Jackson insists that, “technology doesn’t let you easily get to what you want to find out.” Just like an elephant, big data is a lot to move around and has underlying characteristics that can not be made to go in certain directions. Part of his company’s solution involves the support of analysts to help get query chains moving and turned around, rather than just leaving it to the end-user. The appliance can also be dropped in behind the firewall (accompanied by analysts if required) if the business does not want its data to leave the enterprise.
Among vendors coming at big data from a conventional data management perspective, there is a strong desire to normalise the subject. Big data is not so very different nor so challenging, they argue. Get among big data natives and the feeling is very different - almost religious, as one observer points out. They fervently want to believe that an elephant can fly. But if it is your business putting money into big data, it is best to check first what type of animal is taking to the trampolene.
Big data definitions
“Data sets whose size is beyond the ability of commonly-used software tools to capture, manage, and process the data within a tolerable elapsed time.” - Wikipedia
“Big data is characterised by four dimensions: volume, variety, variability and velocity.” - Nick Millman, senior director, Accenture Information Management Services UK & Ireland
“Big data is not just about scale - it is more about understanding how new technologies and approaches are coming together to allow you to do things with data in a different way.” - Richard Kellet, director of marketing, SAS UK
“The number of customers you have may not be growing, but the amount of data you hold on them is. The more you know about them, the better you can target them.” - Dai Clegg, business analytics product marketing leader, EMEA, IBM
“Big data is an expression created to build excitement around products that were already there. Everybody has had tons of data, they have just not understood what they can to to mine and analyse it.” - Sean Jackson, vice-president of marketing, Kognitio
Big data practitioners
Getting the right box to handle big data is only the start of the solution. “There are big implications for the talent that organisations need around. They will need a different profile to the traditional roles of data administrators, data engineers and data scientists,” says Nick Millman, senior director, Accenture Information Management Services UK & Ireland. “That is a challenge because there are not lots of those people around in the employment market.”
Indeed, McKinsey has estimated that in the US alone, between 140,000 and 190,000 data scientists and 1.5 million data-savvy managers are required to exploit the potential of big data. Given the already limited supply of data-literate graduates into the commercial sector, this human resources gap could become a major issue.
Recruiting data scientists is not cheap, even where it is possible. Added to the scale of the processing and analytical resource required, it makes big data a proposition mainly for organisations with big pockets.
But not everybody is over-faced by the human resources challenge. Dai Clegg, business analytics product marketing leader, EMEA, IBM, points to an historical prediction about the growth of the telephone. “Somebody said that by halfway through the Twentieth Century, if telephone usage continued to expand at that rate, one third of the population would need to be switchboard operators.”
Clegg points out: “That didn’t happen because technology arose to automate their job. The other side of the big data challenge is about improving the quality of the tools and toolkits.”