Olivier Thereaux, head of technology at the Open Data Institute, knows a lot about synthetic data. He and his team have been working on a UK government-funded research and development project on risk and data, with close consideration as to how we can increase access to data while managing risk and maintaining trust. This led his team to the dual topics of anonymisation and synthetic data, which they have now been studying for the past six months.
Toni Sekinah: What is synthetic data?
Olivier Thereaux: It is the idea that you can create data that resembles the real thing. By real thing, you would mean data that is gathered by observing the real world or taking measurements of the real world. The idea of synthetic data is, what if you had the ability to generate or create data, through technical or non-technical means, in such a way that it resembles the real thing. The idea is that you look at what characteristics there are in the real data and you try and make sure that the data that you create has similar characteristics.
TS: Can you give an example?
OT: You’ve got a list of people with their height and weight, but it is sensitive data so you don’t want to share that with others but you do want to give them a rough idea of what that group’s typical measurements look like. You can create another data set that has entries that are made up but where the average height and the distribution of weight is very similar to the real thing and that’s what you would call synthetic data.
TS: What’s the difference between synthetic data and dummy data?
OT: Data scientists have been creating dummy data for a long, long time. Dummy data is just numbers to get a lot of data. Synthetic data is a bunch of techniques that try to keep some of the characteristics of the real data and that’s where the difference would be.
TS: What is a use case for synthetic data?
OT: There’s two big use cases for synthetic data. The first one is when you have gathered or you are processing data that is sensitive. It is too sensitive to share, but you still want to share it with people so that they can create software. Ideally it balances the risk between staying relatively similar to the real thing but not similar enough that you are creating a risk of reidentifying people or sharing secrets. The typical use case for synthetic data for the longest time was when we called it dummy data. You have data but not enough and you need much, much more to be able to test a model or test the scalability of software.
TS: How is synthetic data created?
OT: There’s quite a lot of software that can create synthetic data. There’s a number of libraries that allow you to do that. They vary in complexity. There’s one called DataSynthesizer, I believe it is an open source Python library. It is quite mature.
TS: What is the difference between synthetic and anonymised data?
OT: You can use the creation of synthetic data as a way to anonymise data but there are many, many other techniques. So there can be an overlap in that it can be a technique for anonymisation but there are many other techniques for anonymisation and also there are many other use cases for synthetic data.
TS: What else would you like people to know about synthetic data?
OT: There is temptation when you are given synthetic data about something to try and derive insight. That would be terrible, terrible mistake to try and derive those insights and even possibly create policy based on that because you are doing that from synthetic data and not the real thing. Any insight that you would derive from it could possibly be quite mistaken. You don’t know if any insights you gather from the synthetic data are actually true. You should not try and derive insight from synthetic data because your insights are likely to be mistaken. The whole point of synthetic data is to create models and software but you shouldn’t be deriving those insights directly from it.