Phil Harvey has been a data wrangler since 2008, but only recently has he begun to call himself one. He explains that a data wrangler is similar to a data engineer, whose key role is to make data available for use. “That includes finding it, working out authentication and authorisation rights to that data, all the way to manipulation and storage of that data for use by an analyst or data scientist,” he says.
Harvey offers this definition of the difference between a wrangler and an engineer: “If you asked me to be technical, a data engineer is about systematic engineering of a data system to solve a set of data problems, like data access, ingest, manipulation, quality control, etc.” By contrast, a data wrangler would do a lot more one-off tasks.
Although he has a bachelor’s degree in Artificial Intelligence, he has no formal training in data, but prepared himself for a career in data when he became a programmer in 2005. “I have worked with various systems handling different kinds of data from that point,” Harvey says. He started with architectural visualisation and managing render jobs and render files - the data being images and 3D models stored in filing systems and databases. Then he worked in advertising where he handled advertising-related data and gained experience with relational databases and ETL systems.
From there he went on to become CTO of DataShaka in early 2011 and describes himself as an idealistic entrepreneur, “the kind who thinks ideas matter and a good idea is worth pursuing.” He enjoyed exploring his ideas and the freedom that came from being able to make them part of the business. Unfortunately, he found that, at times due to the difficulties of building a start-up, the ideas had to take a back seat. “It’s tough and it’s difficult, but it is immensely rewarding when you find ideas that work in the market as well as in theory,” says Harvey.
DataShaka was set up to solve the “variety problem” with data. In the early 2000s, a report was published which defined big data as having three dimensions - volume, velocity and variety. The problem with variety is that data comes in different formats and different file types. “If an advertising agency wants to track the performance of a particular advert over time, it needs to gather the delivered impressions, the clicks and the types of conversions that ad has generated. This variety of data sources creates a problem,” Harvey explained.
An absence of standardisation means the numbers that relate to each other come from multiple sources in multiple formats. So DataShaka was set up to bring all of the data together by improving the way that time series for analytics was handled.
At DataShaka, Harvey was doing the core engineering of building a new time series database, sorting out core data problems, particularly around access to and conversion of data. He was also interacting with the clients and helping them to work out solutions to their data problems using the company’s technology.
For Harvey, dealing with the human aspect is the biggest challenge of data work. “In general, most people don’t know what they are doing with data. The worst case is where they think they know, but they have a false idea of what is needed. Working through that in a kind and professional manner to help them get an understanding is tricky and difficult,” he says.
Technology was always the easy part for him. Once he worked out where he had to get data from, he could figure out how to get it. When he knew what the data needed to look like for an output, he could determine how to manipulate it into that output. And if he needed to test for quality, he could decipher the steps to do so. “The challenging part is working out what is ‘correct’ from the user perspective and the human perspective,” he says.
The former CTO is about to take on another challenge as he prepares for his new role as cloud solutions architect aligned to partners at Microsoft. His new role will involve helping Microsoft Partners to use their data platform, analytics and internet of thngs (IoT) suite to the best advantage for their clients.
He says: “My hope for the role is that I will get to work with a lot of interesting data products and projects across a large set of domains, so I can be exposed to many more of the data wrangling problems that I find so interesting.” With so many interesting problems in the data space, Harvey sees his move to Microsoft as a great opportunity to work on them in a large organisation, which will give him valuable experience.
Harvey recognises that the question of whether or not there is a future in being a data engineer has arisen frequently as a result of the boom in AI and machine learning. And, while he admits there is a lot that machines can do, the human parts of the job will remain the fundamental parts. “With data wrangling, the part that involves understanding people - empathy, having new ideas about what is needed on particular data sets and thinking in a more philosophical way - will become the core part of the job and that value is very difficult to replace with a machine,” he says.
To someone looking to forge a career as a data wrangler, Harvey would say “do something else first - if you going into it saying, ‘I just want to work with data,’ and you don’t have any experience of how people use it and how people interact in business, you are going to lack context,” he says. He suggests coming into data at the second or third stage of one’s career.
With his understanding of AI from his degree, Harvey can look analytically at artificial intelligence in relation to data. “A lot of the algorithms and techniques that are presented as new and special in AI and machine learning have been around for a long time. It is just the case that now we have the compute capabilities and the data volumes to do something interesting with them at a realistic speed.”
Harvey is glad that his BA in AI trained him to think in a philosophical way before he started programming, but he is concerned that this perspective is not shared by everyone coming into data. “I worry that this approach will get ignored by people with a much more engineering or scientific driven perspective,” he says.
Harvey is, however, hoping to fly the flag for the empathetic view of data in his new position: “I hope I can find some way to help people with a more holistic approach and that is where I am going to pin my colours when I join Microsoft.”