Most data scientists are statistical thinkers. A lower number are geographical thinkers. But how many are comfortable combining both disciplines? When asked to visualise the results of their analysis, statistical thinkers will typically use a chart or dashboard. Geographical thinkers will usually show the results on a map. This article will discuss the challenges to overcome in using geo-spatial data and explore two potential approaches for enabling data scientists to start using location data more regularly in their analysis, helping them to deliver the best of both worlds.
When geo-spatial data is combined with other datasets it can be extremely useful in delivering added value and insights that aren’t easy to spot or identify when looking at a set of numbers or charts with no geographical context. For example, where data is provided against a set of addresses or postcodes, plotting that data on a map can help to identify patterns and trends that are otherwise difficult to see. This can give a fuller picture, supporting the data analysis process.
Geo-spatial data is often split up into different files that can be difficult to work with. Specialist loading software is often needed to compile raw geo-spatial data into a useable form, data volumes can be large and difficult to handle, and because geo-spatial data has often been captured for cartographic display rather than for analysis, the data can be difficult to analyse. Some geo-spatial datasets can contain quirks that require workarounds and additional steps to be undertaken before use. As a result, the overall experience can be confusing and frustrating for the uninitiated.
Clearly the barriers are set high, but how can we make it easier for data scientists - and in particular, pure statistical thinkers - to become more geographically-minded?
There are two potential approaches to help overcome the challenge:
Geo-spatial tool libraries are essential for data scientists who want to analyse geo-spatial data for themselves. These include Apache Sedona, Shapely, Geopandas or GeoMallet.
To demonstrate just how much more useable these tools have become, let’s consider a recent example. To support a request from the Office of National Statistics (ONS) to understand how access to green space and outdoor gardens may impact Covid infections, Ordnance Survey’s data science team looked into the distribution of public green spaces in relation to postcodes.
Using a tool like GeoMallet, it took just hours to process gigabytes of data related to over 33,000 green space sites across Great Britain and 2.3 million postcodes. Previously, such a process would have taken a week or more.
In terms of geo-spatial data, there are great data sets out there for Great Britain. For example, ONS creates nested statistical geographies which are easy to use. In addition, the OS Data Hub provides access to vector, raster and background mapping data, with 50,000 API hits for free. Lastly, shapefiles and geo-packages are great - things like Shapely and Geopandas will help to load these. The essential thing to remember is to check the co-ordinate reference system being used as it’s the first thing that will potentially trip you up.
Geo-visualisation tools have come on leaps and bounds in recent years. Data scientists can let the libraries do the hard work. Tools like Folium, Plotly, Leaflet and ArcGIS online could be useful. Visualising the data set using these tools will deliver a whole new dimension and insight.
Mastering these tools and techniques is complex and can take time. But there is another option if this first approach is out of reach.
The second approach involves transforming geo-spatial data into simpler, contextual metrics that work within existing data science toolsets. The fact is that a lot of data can be “hidden” within geo-spatial data. It can be seen when the data is examined on a map, but often time is needed to extract this information in a useable data format. For instance, information about a property’s dimensions, its situation relative to other buildings or outside space, and details about the amenities that are located close by.
The good news for data scientists is that a huge amount of this data already exists that captures this information, particularly, at the property level. Using the right unique identifiers alongside the data allows it to be linked to any other data set using the same identifiers, for example, using the unique property reference numbers (UPRN) for Great Britain. Being able to access this data can allow data scientists to perform calculations examining, for instance, how close an address is to different amenities in the local area.
Once a proximity calculation has been created for one address, it’s easy to apply the same logic to other addresses and create a list, such as for calculating the distance of each property to the nearest open green space. As well as working at an individual address level, it’s also possible to work at an area level by postcode, ward, county or region. This can allow data scientists to map geographical data to Census data in order to build up a picture of the local demographics.
There are pros and cons to each.
The first approach - encouraging statistical thinkers to think geo-spatially - offers several benefits. First, it offers the greatest flexibility. There is virtually no limit to what can be calculated and achieved in a geo-spatial context.
Immersing yourself and gaining new tools and techniques helps you build better domain expertise and greater understanding of where the results have come from. And, of course, visualising the data geo-spatially should deliver new insights.
Drawbacks of the approach include the fact that there’s a huge amount to learn, the data has lots of “gotchas” and quirks, and generally the datasets are quite large, onerous and can be expensive to process. Whether it’s appropriate to invest the time and money required will depend on what you’re looking to achieve.
The second approach - condensing geo-spatial data into simple, contextual metrics that data scientists can use - also offers several benefits. It’s really easy to get started with, the data provided is in a tabular data format that’s compatible with existing tools and data sets. This removes the geo-spatial barriers to entry, helping to accelerate geo-spatial thinking and visualisation.
In terms of downsides, data scientists will of course be limited by what data is already available in an easily consumable format and what’s already been aggregated. It’s also essential to choose trusted data sources as it won’t be possible to unpick the underlying calculations later.
The best approach will really depend on the problem that needs to be solved. There is no single right answer. It will depend on the approach you’re most comfortable with and the level of detail and the in-depth understanding required to achieve your objectives.
Jonathan Simmons, head of data science, Ordnance Survey