Big data is like teenage sex: everyone talks about it, very few really know how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...
I realised recently I’d been doing it full-on for over 20 years, according to one definition, and patchily for 10 years using another. All the while learning from my mistakes and understanding what makes it really great. It’s probably worth pointing out the analogy sadly finished in the last paragraph.
So, what are the competing definitions? Let’s start by looking at the two main strands. Firstly, the technical, which is generally recognised as collecting, storing, processing and analysing vast amounts of data on lots of cheap hardware, for example Hadoop running on multiple Linux PCs. The size is key here, as this volume of data will not fit on current relational database systems. The second strand is the marketing/press aspect. This is, in essence, data analysis on any size of data. Statisticians/OR/data miners/predictive analysts/data scientists and others have been doing this for years, only now it has a trendy new banner.
These different definitions often cause confusion when they get mixed up. The good news is that, as long as you’re aware of the two definitions, you won’t go far wrong. The even better news is that the keys to successful big data are the same for both definitions. So, I hear you ask with baited breath, what are the keys?
Number one, have an actual business problem to solve. For instance, if you want to reduce churn in banking, make sure the churn is clearly defined (eg, less than £100 and inactive for more than 90 days) There are always business problems to solve, the key is in identifying the right one. Secondly, make sure there are the ways and means to enact the solution. If you’re detecting unprofitable customers, is there a desire to address the root cause of the unprofitability?
This can be tough as it will require a business change and quite possibly a cultural shift for your organisation. It’s not unknown for this stage to be ignored or left until the end, but without proper sight of the business change early on, it’s easy to go in the wrong direction and end up with an accurate, but useless piece of analysis.
Thirdly, ensure the results of any solution are measured. One of the main benefits to this approach is that it’s easy to determine which solutions are working well and which aren’t, thus enabling a test-and-learn methodology to be adopted. The other benefit is it becomes far easier to highlight the success and return on investment of big data to key stakeholders. This is relatively easy to achieve if it’s planned from the outset. Retro-fitting measurement rarely works properly.
The final piece is the data. This is the foundation and building material for any big data analytics project. And it’s almost always the toughest part. This is partly due to perception, as most people’s experience of collecting data is in Excel. And that’s easy, right? The reality is the complete opposite. This has a number of causes - chief among them is the complexity of the systems the data is sourced from. In essence, the more complex your IT estate, the more costly your big data solution will be.
The other big factor is the quality of the data (think about call centre reason codes) and the meaning of it. Quality is paramount, not quantity. I have built a highly accurate model of corrosiveness of acids on human skin using just 28 records, the data being created by a highly-skilled research chemist. I’ve also seen rubbish come out of millions of records, mainly because the data was patchy and not properly understood.
My experience has shown me that, in essence, you need to understand your goals, plan for the outcome, measure your results and go for quality, not quantity. If only I’d known then what I know now…