Confusion reigns around the nature of the job description for being a data scientist. As you will rapidly discover if you look into it, most of what gets done under this term is in fact conventional data analytics. Taking a set of data, testing a hypothesis, then building a model or segmentation to apply to a new data set - these are well established practices.
So it is handy that the Home Secretary has just provided a perfect description of what true data science is really about.
The fact that she did so in the context of defending the right of the security services to access communications data in order to track terrorists - and claimed ordinary citizens suffer no invasion of their privacy as a result - is a subject for another day.
Here’s how she defined the work of GCHQ: “Signals intelligence relies on automated and remote access to data on the internet and other communications systems. Computers search for only the communications relating to a small number of suspects under investigation. Once the content of these communications has been identified, and only then, is it is examined by a trained analyst.”
Now think about how that approach might apply to the commercial realm. The business sets out its targets, such as understanding how brand reputation is being influenced by followers (“suspects”) or how TV ads are driving searches that lead to website visits. Big data is pulled in for those areas of activity and machine learning systems used to grind it. When the black box spits out something, the data scientist who set it up will review the result for significance, reliability and potential value.
Crucially, these practitioners are not assuming that there will be a meaningful answer or that the answer will lie within pre-defined parameters. Instead, they are looking for correlations that indicate where an impact has taken place. Often these are proxy measures, such as gaining a follow or like, rather than direct ones, such as a download or registration. In the same way, GCHQ is searching huge data sets and building a picture of the links between people and how content is being disseminated.
That is where the risks arise to ordinary citizens, of course, since an innocent worshipper may visit the same mosque as a radicalised fundamentalist without knowing it and so be identified as within their sphere of influence. Once placed within such a network, chances are your communications data will end up in front of one of those analysts.
Commercial data scientists can also fall into the same trap. The general assumption is that the stars within a social network exert a causative influence over their followers - that if Cara Delevingne posts a photo of a new outfit, millions of her fans will go and buy something similar.
And this is where data science is defaulting to a model from data analytics, precisely by inferring causation from correlation. While some of Cara’s followers will want to own what she owns, many will take it more as a tip and translate it into their own look. Social network influence models have yet to work out how to handle this “mutation”. But that is precisely what the true data scientists could - and should - be working on.
It’s not easy - even the highly-resourced spooks miss a lot of these triggers, as the recent example of young men from Cardiff travelling to fight in Syria has shown. What that proves, however, is that it is not about assembling ever more data to improve the model. it is about building human intelligence into the machine.