Strava gets a PB with cloud-based data warehouse
Social networks are renowned for the amount of data they amass, and sports social networks like Strava are no different. Strava's data function was buckling under the weight of its own popularity, having recently passed the milestone of two billion uploaded activities. It had a 120 TB data warehouse, 13 trillion GPS data points and 15 million uploads per week. That volume of data became very difficult to handle and process.
"Trying to access and query the data caused gridlock."
There was a bottleneck and performance was suffering because there were limited connections into the database. “We had a growing set of users trying to access and query that data and it caused gridlock and poor performance on all sides. That was a real challenge for us to deal with,” said Cathy Tanimura, senior director of analytics and data science at Strava.
Tanimura manages the analysis and data science teams and saw that the performance of the database was so bad that the people who had to spend all day querying, would leave for a coffee break or even run a query overnight if they had made a mistake. She and her team would also have to ‘trick’ the database into returning queries.
Cloud-based data warehouse company Snowflake appeared on Strava’s radar because of its “compelling approach to compute and storage”. The decision to move Strava’s data to Snowflake was made at the end of 2017, implementation began in March 2018 and data had been fully transferred by June.
The transition took place with support, consultation and guidance from Snowflake, helping to unload the data from Strava’s old solution and load it into the cloud data warehouse.
According to Tanimura the switchover was virtually seamless, despite moving jobs that were running at the time. “The analysis team just didn’t have downtime. They were asking how to adjust over the course of a week, then folks were just up and running, got up to speed and got productive quickly,” she said.
"To query one billion rows is 20 minutes not 20 hours."
The result is much faster performance. Now if a member of her team needs to query one billion rows, will take 20 minutes to run instead of 20 hours. They are also able to change table structures without it affecting other team members who are querying the data. This is because it is running on different compute clusters.
A tangible upside of now being able to query quickly and have a continuous flow of analysis is that the analysts and data scientists have been freed up to do other things. Tanimura said they have rolled out a new tool to help track athlete interactions in the app, so Strava can understand which features users like and use.
They have also rolled out a new ETL scheduling process that helps Strava be more nimble and facilitates self-service of data. Strava is also changing its email vendor, and the data from that email vendor as well as data from the warehouse is helping to build a picture of the athletes - such as the type of athlete they are, the activities they have done in the past and whether they are a member of the premium Summit service – to improve email communications to them. Finally they data scientists and analysts have created a ‘Year in Sport’ highlight reel video for each user.
So Strava has hit a PB with data, while helping its users track theirs.