When data wrangling vendor Trifacta was launched in 2011, it was driven by two goals - to improve the productivity of data analysts and to solve one of the biggest and hardest problems of dealing with high-volume unstructured data in a Hadoop environment. Its free desktop edition, Wrangler, has been adopted by more than 4,000 companies in 132 countries to explore, transform and join diverse data for analysis.
DataIQ caught up with chief strategy office and co-founder, Joe Hellerstein, who is also JIm Gray Chair of Computer Science at UC Berkeley, when he was in the UK for the Bg Data Ldn event and launching the company’s new solution, Wrangler Edge, which is aimed at analytics teams and for the first time handles data sets beyond Hadoop. Explaining the move, Hellerstein said: “You run into tonnes of people who have data problems, but who don’t want to manage a Hadoop cluster, so we can now engage with them as well.”
“When we built Trifacta, analysts told us they were spending all their time doing data wrangling. We published that as an academic study. As we’ve brought the product to market, it’s gratifying to discover it was not just an academic problem,” he explained. “The business is going the way I expected. We’ve built the product we said we would build, the problem is the one we said we would solve, and the market for the software is what we hoped.”
Unlike many tech start-ups, Trifacta has not had any need to pivot towards a different market - instead, the problem of enabling the analysis of unstructured data has simply grown bigger and more complex, with new data sources and types adding to the challenge. The machine learning which drives Trifacta continues to be the source of its value, but Wrangler Edge has picked up on users’ desire to be able to intervene and apply their human understanding.
“Our philosophy has been that AI and machine learning are very important, but there are times when human context has a part to play. That’s why we made the interface very easy to go back and forth between the two,” explained Hellerstein.
“That said, our interface was more usable than what most of the market was offering, but was perceived as very technical by some. So we have been working hard to build a ramp between the two that is very smooth, from user-guided to technical. The middle - Expression Builder - was not there until recently for people who need to get involved and over-ride recommendations, but who are not technical. We learned from experience there was a gap there - we didn’t see that at first.”
Hellerstein brings a deep academic understanding of AI to bear on the product’s development, pointing out that a constant thematic question is the tension between model advocacy and model comprehensibility. “Models that have been working very well in recent years, like deep learning, are not humanly comprehensible once they have been trained as to why they do what they do. Simpler models, like decision trees - if x is less than this, go down this branch - are more human understandable, but have less accuracy. Depending on the context, you may want one or the other,” he explained.
“The machine doesn’t have enough signal to know the user context. The data patterns may not be relevant to the user - they need to bring their domain expertise,” he added. For example, web log data may have been used in an IT environment to predict outages and plan upgrades. If marketing wants to use the same data for customer analytics, none of that previous modelling is relevant.
Says Hellerstein: “We knew from the beginning that the mahine learning was an aid, but the user had to drive it. We’ve never tried to expose the machine learning models explicitly, we use the visualisation tools as the way to feedback.”
As an alumnus of Berkeley, where his peer group included the founders of Cloudera and VMWare and his Masters supervisor was the godfather of relational databases, and now a professor, understanding how to port research learnings into commercial outputs is baked into his approach. “Role modelling around being both an academic and an entrepreneur is very strong,” Hellerstein explained.
That said, deciding how to commercialise those research ideas can create a tension. Silicon Valley has a very loud voice with software developers at the likes of Google and Twitter keen for solutions to the problems they face. Pure computer science might view those challenges as equivalent to the ones faced in the commercial world.
But Hellerstein argues that, “producing software that is relevant to the needs of Google engineers is completely different from producing software that is relevant to a business analyst in a financial institution or a person at the Centre for Disease Control looking to take incoming data and do rapid response. They are completely different users with different needs.” He added that, “I’ve learned a great deal from stepping out of our tech bubble and only talking to computer scientists. Talking to users in other domains has been very valuable to my research agenda.”
Just as demand for big data analytics capabilities within enterprises has rocketed, so too has the demand for academic training. Hellerstein jokes that everybody in California wants to be a computer scientist. That is visible in the demand for his entry-level database course, where student enrolment has risen from 100 to 500 in seven years and is now delivered twice annually. Even though it is a technical speciality course, demand is now coming from across disciplines.
That is putting significant pressure on the conventional teaching model. At first, Berkeley solved the issue by recording sessions and making them available to view online. But two problems emerged from that: the first was that the videos did not have closed captions, which led to the university pulling them in case it breached disability access legal requirements; the second was that by the end of the course, only 25 students were turning up to the lecture hall. Hellerstein ruefully acknowledges that, in order to get use of the biggest lecture hall on campus, the course was running at 8am - not a popular time among students.
“I should be designing a course for online teaching, not classroom lessons that can then be watched online later. So student behaviour is changing, the volume at which we are teaching is changing. If we solve the problem on campus, we are solving it for a bigger community,” he noted.
This change is going even deeper. “One other thing happening at Berkeley is that we are launching an undergraduate programme on big data. There have been a lot of graduate course for people in statistics and applied science. We’re trying to design something so that maybe 50% of all students go through it, like they go through maths. How do we slice data to a diverse community of students? That is exciting,” he said.
Just as Trifacta has brought the power of AI to fixing the core problem of data wrangling, thereby enabling the democratisation of data by removing much of the manual labour in its preparation, so Hellerstein sees a broader opportunitiy. As he told DataIQ: “It is our belief that data literacy and the ability to apply computational methods to data is going to become the lingua franca across many industries and intellectual endeavours, even the humanities.”