DataIQ has partnered up with Tech Nation UK to have a peek behind the curtain of a data project and find out exactly what takes place when interrogating a dataset to find insights. Diana Akanho takes us, step by step, through the stages of a data project.
Here is what Diana Akanho, senior insights manager at TechNation UK had to say.
The reason we are doing this project in the first place is we see there is demand for this type of information from clients, sponsors and partners and we seek to meet that demand.
The working title of this project is ‘Exploring the landscape of UK tech jobs and skills,’ which aims to help and inform employers, government policy makers and investors - but then also employees who may be seeking career opportunities or would like to upskill and want to understand the skills required for digital/tech roles.
The data, in this case, is a purchased dataset from Adzuna, a search engine for job advertisements. We don’t have to make any considerations for ethical use of the data or privacy because it was supplied to us in an aggregate format and there is no sensitive information included in it that could identify anyone.
First have a research question you want to answer.
The first step from a statistical analysis point of view is to have a research question you want to answer, based on background research. Then create a hypothesis that will be either validated or not through the analysis.
A possible hypothesis could be ‘the number of digital/tech roles within the UK, particularly in London, has grown over the past 5 years’.
At the start of a data project, you are looking for data to answer a question you already have, but at the same time, you are looking at the data and thinking about what conclusions you can draw from it.
When digging into the data, we’ll find the hypothesis true or false.
However, when digging into the data, we will find the hypothesis to be true or false - this will then lead to further questions we’d like to find from the data to either prove or disprove findings.
Before cleaning, processing or formatting the data, there are processes involved. First, I checked the size of the data. Then, I did a quick scan of the data initially to understand the type of variables I have. This is so I know the type of data I will be working with to start thinking about the methods I will use to analyse it. The data could be continuous, text-based data, categorical, etc.
I also need to know what format each column of data is in. For example, several data points were included per row in the location column – the country, region, county and city (UK~South East England~Oxfordshire~Oxford). As my dataset is only UK-based I can remove ‘country’ from my dataset; it is not adding any relevant information and by removing it I can reduce the size of the dataset.
I replace the tilde with a comma as I prefer working with comma separated values. It is more common and, from my experience, is easier to process.
The Adzuna dataset is the main dataset I’ll be using. However, in addition to this, we may possibly use findings from the Employer Skills and the Employer Perspectives Surveys to help provide context to the findings.
Rachel Keane from DataTech Analytics is helping by providing trends on the type of digital and technology roles that she’s been recruiting for over the years. To provide context to our findings, we also have access to data from The Office of National Statistics, which will be viewed to see if there is any relevant information that can be used to support our story.
In addition, we will possibly be looking at the following datasets if they can provide context and can fill any gaps.
A data project is not a one-person job, so I’ll be supported by a team in this process. George Windsor, head of insights is responsible for the initial scope of the project. At a later stage more people will be involved. Lucy Cousins, data design manager will help with the visualisation of the insights, and Bien King, programme manager/scrum master, is responsible for making sure any obstacles are overcome so the projects are delivered on time.
The length of the project depends on the size of the data and tools used.
The length of this project will depend on the size of the data and the tools to be used. For example, a 1GB dataset is quite small, so it would be much quicker to process the data when doing analysis. The larger the dataset, the more processing power you need and the longer it takes to run the analysis.
This dataset has about 70 million rows, given that the nature of the dataset is predominantly text-based. In particular analysing employers demand for skills is very text-heavy. I have been using AWS which is good because you can scale up whenever needed.
With tools, people tend to use what they are comfortable with. I use R because it is good for doing statistical analysis, there are a lot of libraries available and I like it! I write scripts within R to check, clean and format the data. I also used the command line interface due to the size of data I was working with, to append datasets.