Anyone in business is aware that the accuracy and latency of the data driving their decisions is key and that trust in the process that presents the data they work with is essential. However, the majority of data specialists, analysts, data scientists and data quality managers are not confident they can bring trusted data at speed to their organisation.
A new study by Talend shows that the primary concern is the inability to combine speed with integrity. Delivering intelligent and trusted data is crucial for business success in today’s fast-changing consumer landscape, yet just 11% of data specialists believe their business has both. This poses quite a problem for organisations that are increasingly turning to data and predictive modelling to aid business decision-making.
So what do you need to understand in order to get the best performance from data and to be confident that the data you use and touch is up to the standards required?
It all depends on the purpose for which the data is going to be used. Part of the problem comes from the volume of data that organisations are dealing with and the fact that this data has been acquired and collected in different places, at different times, with varying degrees of accuracy.
For instance, data which is very fleeting, like clicks or social media interactions, is very valuable to make an instant decision online, but is less predictive in the long-term when merged with more stable structured or personal data held by the organisation. It is tempting to give overdue significance to online data as the volume of data points can be so much higher than more persistent data, such as income or where someone lives. However, the half-life of a click is much shorter and is representative only of a single snapshot decision, rather than being a major lifestyle descriptor.
As growing numbers of disparate data sets are combined, confusion as to how much weight should be given to the different sources grows and the overall level of confidence diminishes. It gets even worse once modelled data is thrown into the mix.
Predictive models can be immensely useful, often making very accurate predictions or guiding knotty optimisation choices,. But if confidence in the underlying data is low, the likelihood of benefiting from a decision being made is also low.
When used in the right way, big data and predictive models can help overcome bias, make great predictions or guide difficult optimisations. The best course of action is to be confident of the data you are using in the first place and to reduce the time that your teams spend time defending data or re-validating it.
A confidence score is a simple figure that indicates the confidence level of that piece of data, ie, how accurate it is. Traditionally, confidence scores described the provenance of the data - if it is known where the data came from and if it is fully verified. As data has become more complex and unstructured, confidence scores have also become more sophisticated, incorporating several factors that help to establish the reliability of the data, such as:
These factors help to paint a more detailed picture about the data subject and there are many more factors that can be applied. Additionally, each can be weighted in terms of their relative importance to the overall score.
The problem with modelled data is that, by its very nature, it is modelled. Errors and inaccuracies can creep in making it at best useless and, at worst, a dangerous tool in business decision-making. That is why confidence scores are crucial to today’s modelled data attributes.
There are two main pitfalls when creating model accuracy and confidence scores, both due to training data. The first is when there isn’t very much training data available. The second is when there is an abundance of training data available, but it is skewed or not representative of the data to be predicted.
If this is the case, there is a significant risk that the model will overfit and produce high confidence scores for inaccurate predictions. This is because the scoring population is inconsistent with the training population. It’s like creating a model to identify oranges and using it to predict apples.
To mitigate the risk of small training data volumes, a good knowledge of mathematics is key to selecting the right statistical approach which sets upper and lower confidences levels that are reflective of volatile data. However, the solution to the second issue is more complex than it might seem at first.
It is crucial to create a process that ensures the test data is representative of the training data and vice versa. In recent times, the flood of data has removed the need to be strict with confidence scores and boundaries. However, when modelling on skewed data this discipline is still imperative. Training and test data must be calibrated to remove bias.
Having a clear overview of how data is being used and what value each attribute is delivering helps people across the organisation to measure the likely impact of the data to drive outcomes and make more informed choices.
For many businesses, trust is just half the battle. Explainability is also key. If brands are to make important decisions around pricing, qualification and risk using data science, they have to be able to understand how models achieved the scores they have and how accurate are the models themselves.
Under GDPR companies are now required to know the provenance of the data they hold and process. Furthermore, consumers have the right to know exactly what their personal information is being used for and why decisions have been made as a result. Given that customer data is a primary source of fuel for the algorithms constructed using machine learning, organisations have a legal responsibility to understand these models.
Explainability is a case of plain and simple ethics. Understanding a model to anticipate any unintended consequences and potential bias that could impact on vulnerable customers (or indeed any customer) is morally the right thing to do. Many organisations have come under fire for biased models, such as Amazon’s advanced AI hiring algorithm which was discovered to favour men heavily for technical positions. As models are becoming more complex with larger numbers of features and increased feature engineering, explainability becomes more of an issue, which is why confidence scores are business-critical.
Clearly, the creation of confidence scores can often be just as complex as building a predictive model.
When done well, combining vast amounts of data from multiple sources to create increasingly sophisticated algorithms will improve corporate performance. The key is to make sure that the methodology and approach taken within your organisation is robust. This means:
This will all involve greater rigour and investment. But in the long term it will have an important impact on your business.
Caroline Worboys, COO, Outra
Caroline is a seasoned data practitioner. She is listed in DataIQ’s 100 most influential people in data and is also vice-chair of the DMA. She has previously led start-ups, large corporates and has founded and sold several successful data businesses.