Notonthehighstreet: why a better data architecture is not on its premises
With a unique selling point of curating all of the sellers and products on its markeplace website, notonthehighstreet bills itself as the home of thoughtful gifts. It began life in the kitchen of one of its co-founders in 2006 and has always lived online, with the main website being a Ruby on Rails app with a MySQL database. This is what the engineers would use to query the products and the data explained Ben Davison, technical lead of the data team at notonthehighstreet.
The team would run queries off the production database and get reports and CSVs, which was adequate "when you are just trying to get through the third year. “That was super slow and, coming towards 2010, the data started to grow, the site started to grow so the company couldn’t keep having the engineers run loads of queries on the production database. It was just going to die after a while,” said Davison.
"The database was just going to die after a while."
To solve this, the company decided to build a completely custom Ruby on Rails MySQL data warehouse which, at the time, “was quite nifty. It had this custom Ruby DSL which would allow you to pull out facts and dimensions from the production database, put it in another MySQL database and then it also had a custom BI element to it,” explained Davison.
But by 2015, the company wanted to make the most of its data and sought to bring in Google Analytics data as well as x and y data from upstream systems, and so looked to Amazon Redshift. It ended up with a MySQL data warehouse as well as an production data warehouse (internally called Chicago) which would write everything in S3. Over time, more and more processes were added, including API integrations that dealt with the analytics and the CRM provider.
"The engineers had coded themselves into a corner."
However, there was a problem. “The engineers had coded themselves into a corner. They had to maintain Python scripts which could fail at any time and all of these scripts relied on each other and so they were spending so much time just keeping all the stuff running.”
When Davison joined the retailer in February of this year, brought in by Andrew Thomas, the data director, he discovered the architecture was falling apart.
“We couldn't get anything done, we couldn't get new data sources in. We are still spending a massive amount of time keeping all these things running, having to keep API integrations up to date,” he remembered.
It took eight to nine hours to recover from a failure - and it failed a lot.
The system was also very fragile because if new data came in that broke something, the entire data pipeline would be down for a day. By way of comparison he said that in 2010 the MySQL Chicago data warehouse took between 20 and 30 minutes to run. By the time he joined, the system would take eight to nine hours to recover from a failure - and it failed quite a lot.
Along with Davison, the company had also hired data analysts and data scientists, but it wasn’t getting true value from them because of the data architecture. They needed a new tool that had to fulfil three requirements: de-coupled storage and compute; easy to maintain; and lower cost than what they were spending at the time.
They chose the data-warehouse-as-a-service platform Snowflake because it met all those requirements as well as having “really good access control.” The noonthehighstreet data team also outsourced to Alooma, a data pipeline integration platform.
"We don't run anything. We don't host anything."
They also brought in an open source tool called DBT that Davison believes will be the next Airflow, as well as Astronomer to run hosted Airflow, and Tableau for business intelligence. In the future, Davison will be looking to get Spark as his data scientists tell him every week that they need it.
The advice he gave to others on new data architecture was to outsource where possible: “We don't run anything, we don't host anything, we just went and got other people to do it for us.”
Benjamin Davison was speaking at a Snowflake event.