Data scientists and data engineers are essential for gaining insights from big data. But what exactly do these two roles involve and how much overlap, if any, is there? Fundamentally, how can and do they work together and what does the future hold for further merging or separating of these two massively important data jobs?
These were questions posed to data scientist Dr John Sandiford and data engineer Peter Hanlon, two data professionals with a wealth of experience. The forum was a panel discussion hosted by Kubrick Group, chaired by Zahid Qayyum, R&D principal engineer, innovation analytics at P&G.
Sandiford works for Bayes Server as the chief data scientist/CTO and also gets involved in business development. “We build AI software covering a wide variety of aspects from predictive analytics to prescriptive analytics, diagnostics and anomaly detection,” he explained.
Hanlon is the director of monetisation at the Telefónica group, looking after all the B2B and B2C customer-facing products. He said: “One of my recent things has been building chatbots for Telefónica. Think of it as our version of Siri.”
Despite their disparate backgrounds, both Sandiford and Hanlon were in agreement that data scientists and data engineers should be working in partnership rather than against each other.
When Hanlon was asked how the distinction between scientists and engineers within his team was made, he responded that he runs a completely mixed team with everyone “just working together.” However, he did point out that the scientists “maybe” do more of the algorithm development and the engineers more of the “lifting and shifting.” He added that he sees great success when data scientists and data engineers are paired together.
Hanlon gave an overview of how his team works side-by-side. He says once they’ve spoken to the client and thought about exactly where the value lies and how they need to shape the proposition, then the data scientists, data engineers and data architects all work on it together.
He said that the data scientists will suggest a particular type of modelling or clustering algorithm, then the data engineers will say where they will get the data from and how they are going to process it. Finally, the data architects will identify the client’s specific needs and how the data needs to be packaged in a way that is most useful to them.
Sandiford seconded that statement by saying that there is crossover between the disciplines as most data scientists will know a bit of data engineering and vice versa and the two should work very closely together. He also said that it is “generally a bad idea” to off-shore data engineering.
Referring to the development of the disciplines, Sandiford explained that data engineering emerged from data science as big data became more complex. He said that, historically, data scientists would have read directly from system to record and a lot of transformation would take place in order to get the data to the format the machine learning or AI could use. He said, “the old school Dr Ralph Kimble data warehousing-type stuff wasn’t very effective for data science. It put everything in exactly the wrong format for a data scientist.”
Sandiford went on to say that the discipline had become so big that it had to fragment into multiple disciplines. “It is natural when things get complex that they form into multiple disciplines but, nonetheless, there is a huge crossover.”
Both Sandiford and Hanlon agreed that data scientists are vitally important at the innovation phase or “upstream”. Sandiford said that data scientists are the ones helping to confer data, with the business, into value. Therefore, it is a mistake not to involve them in the conception of data platforms that big corporate companies spend heavily to build. Hanlon described his R&D department as fairly “data science-heavy” with a couple of data scientists taking an idea through to a very early stage proof of concept. Following on from that, both experts stated that data engineers are essential for what is conceived by the data scientists to be turned into reality.
Hanlon gave a clear example of why data scientists need to work with data engineers. He said: “If they do the data engineering themselves, they will run into problems as they won’t have thought about how it needs to run, if it’s ever going to be a live product.”
Sandiford punctuated this point by saying that data scientists need to understand what the data engineers are doing and their plans for the product to work. He said: “If they just think ‘we’re going to do a proof of concept, or get a dump from some SQL database or Hadoop’, they’ll build a model without thinking about all the other things that need to happen to put a model into production. As such, pair working is a method that is used at Telefónica. “We tend to have greater success when we pair them together,” said Hanlon.
Sandiford said that an advantage of having scientists and engineers working together is that there is “some real structured discipline around security and integrity,” and that it can potentially free up data science time from data ingestion activity, “so perhaps the 80:20 rule will migrate to 50:50.”
However, he also highlighted a problem that can occur when people from the two disciplines are working together. Sandiford said that communication can be a challenge - he usually finds it is “a huge problem.” As a result, data engineers often don’t understand the needs of data scientists and vice versa, which can lead to them being at loggerheads. He also said that, if they work in silos, the data scientist may end up getting aggregate data when they wanted raw data.
Hanlon said that because both disciplines use a lot of the same technologies, the key difference between them, for him, is the mindset and the focus. He said: “Data scientists need to know all of the tools [like Python] as well. It’s just their focus will be on building models, whereas data engineering is more about building systems.”
Hanlon spelt out the tasks of a data engineer - sourcing, staging, cleansing, transforming, ETL and getting data ready to be consumed by somebody else. He added that the tools and software they would need to use are Python, Bash, JSON, CSV and various serialisation formats.
With reference to background of his team, Hanlon said that the majority of his data scientists come from academia with 90 per cent having PhDs, while the data engineers come from a range of backgrounds, “some from ETL and some from Java.”
Asked whether companies should outsource their data expertise or grow it internally, Hanlon said he is of the view that internal is the way to go. He understands the importance of targeted help, but for him, there is no substitute for business experience. He said: “Ultimately, the people that you’ve had in the business are the people that have some ownership and they’re the people that you need to bring it forward.”
According to Sandiford certain skills belong in separate camps, while there are others that were used by both. In his view, decision automation, probabilistic programming, as well as machine learning, reinforcement learning and deep learning all come under the data science umbrella. Testing, software engineering, alerting and monitoring, and data transformation fall into both camps. Data architecture, security, backup and streaming, and batch processing are in the domain of data engineering.
When the panel was asked where data architects fit into the equation, both experts replied that developments in software have created solutions to problems posed by data architecture. In the last two years, the architectural issues have become a lot easier for data scientists. Said Sandiford: “Data scientists needed to understand things like how they will deploy their model or are they going to have web APIs so people can make live predictions, hitting their models or the ability to put derivative data back into the data master.”He said that the likes of Azure and other services in the Microsoft stack are taking care of the architectural issues.
Hanlon responded that he considers all of his data scientists and data engineers to be part data modellers, and that data modelling and data architecture are still important these days. Telefónica uses Amazon Web Services which automatically solves a lot of data architecture problems. “There’s a template out there and it all plugs together nicely,” he said.
Hanlon said that for data engineers the principle software packages four or five years ago would have been Hadoop and Java Map Reduce. But there has been a transition to “much more efficient tools”, such as Spark, Scala or Python driven by vendors like Hortonworks and Cloudera.
He also said there will be increased democratisation which will lead to data-science-as-a-service and data management platforms that will be important for small to medium-sized businesses. “You won’t need to build your own. There will be tools and services and software out there that you can buy,” said Hanlon.
When asked about data democratisation, Sandiford said that it is a challenge that needs to be addressed. He said: “It needs to be thought out earlier, rather than later because you can’t open up a system and expect 10,000 people to be able to execute complex SQL queries on your data. You’ve got to think ahead.”
He pointed out possible issues such as security and data breaches and called for a middle ground. “You don’t want to lock it all down. People want to do stuff, they want to self-serve and you want to give people the power, but you need to think carefully about how you do that,” said Sandiford.
In fact, Hanlon said that lack of access to data is the number one barrier in his area. “We try to build globally repeatable solutions that we can drop into every Telefónica operating business. Just getting access to the data is definitely the single biggest blocker,” he said.
Sandiford was asked for advice on how to become a data scientist or data engineer. Qayyum pointed out the poignancy of this question in the context of an acute skills shortage in the sector. Sandiford replied that anyone can look at jobs boards to see the skills that data-driven companies require. He added that, ultimately, if someone has an interest in working in data as either a scientist or an engineer, “where there’s a will, there’s a way.”