Konrad Sippel, head of the content lab at Deutsche Boerse, has created an analytics lab that fosters collaboration between data scientists and business experts. His team at the content lab also uses data to test ideas for projects that can create for value for the German company. He explained how the team of 15 - ten data scientists and five data engineers - do it..
DataIQ: How did the content lab come about?
Konrad Sippel (KS): "Two years ago, I started building up a data lab for Deutsche Boerse, getting data scientists and data engineers on board to create value from advanced analytics, machine learning and all these cool things."
DataIQ: What technology tools and platforms do you use in the data lab?
KS: "We didn't just build a team of people, we also said we needed a completely new technology stack to put this on top of. We felt that we don't just want to be explorative in terms of data scientists, but we also want to be spearheading technology development through this new team. We decided to go for a cloud set-up, we're in AWS, we use Cloudera as our cluster and we use Trifacta as our data importing, cleansing, wrangling default solution to bring data in.""
DataIQ: Can you give an example of a project you have worked on in the data lab?
KS: "Fraud detection. Identifying SWIFT messages that our customers are sending us that lead to a cash-out from the bank or the clearing house and checking for anomalies. Our risk department asked us to look into building a checker here on finding odd transactions. It's not as trivial as it sound."
"Credit card companies use AI a lot to identify fraud. Their data scientists have the benefit of having a lot of transactions and a lot of fraud, so they know what fraud looks like, and they can label their data sets nicely."
In our case, we've been looking at seven million historical transactions and there wasn't a single case of fraud. We're trying to protect ourselves for the future and we have no idea what fraud will look like once it hits us because it’s most likely not going to look like the Bank of Bangladesh type of fraud."
"The idea is to build something that has a high chance of catching a black swan once it comes along. But, you don't know whether it is a swan or a bird or black or yellow or green, so you have to define it as odd."
"In the data science team we hired, we didn't hire financial experts. We hired data experts and they come from all sort of walks of academia. Most come from university, having worked either as professors or post-docs at some of the leading unis worldwide."
"One guy worked at Caltech and he was doing research on planets that might hold extraterrestrial life. We don't really know what these planets look like, but we do know what they don’t look like. So, he with his team at Caltech developed a machine learning algorithm about what is a weird planet - we got him to adapt that to develop an algorithm that finds what's a weird transaction."
"The key here is this goes beyond just statistical anomalies. It's the idea of combining multiple dimensions of data. So, looking not just at the amount [of the transaction], but also the time at which it is taken, the currency in which it is done, where it is going, what type of instruction it is. In total, we're looking at 17 different features in the original data set and then we combine that to a very large number of combined multidimensional features and that’s what the machine learning in the background does."
DataIQ: How did Trifacta help you with this project?
KS: "The original data set, the seven million transactions, had to come into our system somehow and it had to be formatted correctly and prepared. As with any data set that comes into our data science workbench, we run that through Trifacta and we use the front end that it provides to actually collaborate with the person who is the domain expert on the data to prepare it."
"In this case, we can get the SWIFT expert to log into the project and we can collaborate. We can really sit there and say ‘OK, what does that column mean?’"
"They might say, ‘you really need to split those two fields out in order to get that feature separated.’ We can do that right away, so we don't need IT. We just need the IT people to dump us the raw data as it is in the database. Then we load that into Trifacta, we do all the pre-science modifications. By the time the data scientist looks at it, it is already nicely labelled, clean, ready to do some good science on it."
Sippel and Jeremy Perlman, vice president EMEA, Trifacta spoke to DataIQ at the Strata Data Conference.