How Trifacta built the human-AI interface for data wranglers

ao link

Members

Contact

Free AI assessment

New to DataIQ?

Take our FREE data literacy indicator now

Unlock the power of data - take our FREE data literacy indicator now

When data wrangling vendor Trifacta was launched in 2011, it was driven by two goals - to improve the productivity of data analysts and to solve one of the biggest and hardest problems of dealing with high-volume unstructured data in a Hadoop environment. Its free desktop edition, Wrangler, has been adopted by more than 4,000 companies in 132 countries to explore, transform and join diverse data for analysis.

DataIQ caught up with chief strategy office and co-founder, Joe Hellerstein, who is also JIm Gray Chair of Computer Science at UC Berkeley, when he was in the UK for the Bg Data Ldn event and launching the company’s new solution, Wrangler Edge, which is aimed at analytics teams and for the first time handles data sets beyond Hadoop. Explaining the move, Hellerstein said: “You run into tonnes of people who have data problems, but who don’t want to manage a Hadoop cluster, so we can now engage with them as well.”

“When we built Trifacta, analysts told us they were spending all their time doing data wrangling. We published that as an academic study. As we’ve brought the product to market, it’s gratifying to discover it was not just an academic problem,” he explained. “The business is going the way I expected. We’ve built the product we said we would build, the problem is the one we said we would solve, and the market for the software is what we hoped.”

Unlike many tech start-ups, Trifacta has not had any need to pivot towards a different market - instead, the problem of enabling the analysis of unstructured data has simply grown bigger and more complex, with new data sources and types adding to the challenge. The machine learning which drives Trifacta continues to be the source of its value, but Wrangler Edge has picked up on users’ desire to be able to intervene and apply their human understanding.

“Our philosophy has been that AI and machine learning are very important, but there are times when human context has a part to play. That’s why we made the interface very easy to go back and forth between the two,” explained Hellerstein.

“That said, our interface was more usable than what most of the market was offering, but was perceived as very technical by some. So we have been working hard to build a ramp between the two that is very smooth, from user-guided to technical. The middle - Expression Builder - was not there until recently for people who need to get involved and over-ride recommendations, but who are not technical. We learned from experience there was a gap there - we didn’t see that at first.”

Hellerstein brings a deep academic understanding of AI to bear on the product’s development, pointing out that a constant thematic question is the tension between model advocacy and model comprehensibility. “Models that have been working very well in recent years, like deep learning, are not humanly comprehensible once they have been trained as to why they do what they do. Simpler models, like decision trees - if x is less than this, go down this branch - are more human understandable, but have less accuracy. Depending on the context, you may want one or the other,” he explained.

“The machine doesn’t have enough signal to know the user context. The data patterns may not be relevant to the user - they need to bring their domain expertise,” he added. For example, web log data may have been used in an IT environment to predict outages and plan upgrades. If marketing wants to use the same data for customer analytics, none of that previous modelling is relevant.

Says Hellerstein: “We knew from the beginning that the mahine learning was an aid, but the user had to drive it. We’ve never tried to expose the machine learning models explicitly, we use the visualisation tools as the way to feedback.”

As an alumnus of Berkeley, where his peer group included the founders of Cloudera and VMWare and his Masters supervisor was the godfather of relational databases, and now a professor, understanding how to port research learnings into commercial outputs is baked into his approach. “Role modelling around being both an academic and an entrepreneur is very strong,” Hellerstein explained.

That said, deciding how to commercialise those research ideas can create a tension. Silicon Valley has a very loud voice with software developers at the likes of Google and Twitter keen for solutions to the problems they face. Pure computer science might view those challenges as equivalent to the ones faced in the commercial world.

But Hellerstein argues that, “producing software that is relevant to the needs of Google engineers is completely different from producing software that is relevant to a business analyst in a financial institution or a person at the Centre for Disease Control looking to take incoming data and do rapid response. They are completely different users with different needs.” He added that, “I’ve learned a great deal from stepping out of our tech bubble and only talking to computer scientists. Talking to users in other domains has been very valuable to my research agenda.”

Just as demand for big data analytics capabilities within enterprises has rocketed, so too has the demand for academic training. Hellerstein jokes that everybody in California wants to be a computer scientist. That is visible in the demand for his entry-level database course, where student enrolment has risen from 100 to 500 in seven years and is now delivered twice annually. Even though it is a technical speciality course, demand is now coming from across disciplines.

That is putting significant pressure on the conventional teaching model. At first, Berkeley solved the issue by recording sessions and making them available to view online. But two problems emerged from that: the first was that the videos did not have closed captions, which led to the university pulling them in case it breached disability access legal requirements; the second was that by the end of the course, only 25 students were turning up to the lecture hall. Hellerstein ruefully acknowledges that, in order to get use of the biggest lecture hall on campus, the course was running at 8am - not a popular time among students.

“I should be designing a course for online teaching, not classroom lessons that can then be watched online later. So student behaviour is changing, the volume at which we are teaching is changing. If we solve the problem on campus, we are solving it for a bigger community,” he noted.

This change is going even deeper. “One other thing happening at Berkeley is that we are launching an undergraduate programme on big data. There have been a lot of graduate course for people in statistics and applied science. We’re trying to design something so that maybe 50% of all students go through it, like they go through maths. How do we slice data to a diverse community of students? That is exciting,” he said.

Just as Trifacta has brought the power of AI to fixing the core problem of data wrangling, thereby enabling the democratisation of data by removing much of the manual labour in its preparation, so Hellerstein sees a broader opportunitiy. As he told DataIQ: “It is our belief that data literacy and the ability to apply computational methods to data is going to become the lingua franca across many industries and intellectual endeavours, even the humanities.”

Log in to read the entire article

Gain access to the entire article by logging in or registering for a free account here.

Did you find this content useful?

Thank you for your input

Thank you for your feedback

Next read

DataIQ 100 Success Series: EDF – National sustainability and preparing for the unexpected

EDF’s head of data and CRM, and member of the DataIQ 100 Martin Aylward, spoke to DataIQ editor Alex Roberts, about what data leaders need to succeed and how investment in data teams can provide extreme unseen wins.

Next read

Pioneering AI initiatives revealed: DataIQ Announces 2024 AI Awards Shortlist

15 Apr 2024by Alex Roberts

The shortlist for the 2024 DataIQ AI Awards has been unveiled, with the winners to be announced at the DataIQ Summit on May 21.

Final chance to enter the 2024 DataIQ Awards and demonstrate your team’s prowess

08 Apr 2024by Alex Roberts

The final deadline for submissions to the 2024 DataIQ Awards – 26 April – is rapidly approaching, so make sure you have entered to clinch a title.

Data Literacy versus Data Culture – DataIQ’s view

03 Apr 2024by Rachael Pimblett

DataIQ explains the differences between data literacy and data culture as understanding the differences is essential to achieve buy in and support from business leaders.

You may also be interested in

AI just rocked Las Vegas. But where was data?

DataIQ chief knowledge officer and evangelist, David Reed, examines the gamble surrounding AI and why businesses need to play the game.

DataIQ 100 Success Series: Data Driven Danske – Leveraging data in a new way for legacy business

Legacy businesses have a unique set of challenges when adopting a new data-driven future. Data Driven Danske is a transformational journey taking Danske Bank employees to the next level of leveraging data and analytics to drive value for customers, shareholders, colleagues and broader stakeholders.

Analytics and Insight business leaders data culture data literacy data objectives DataIQ 100 finance Financial Services/Banking investment legacy talent Technology Technology and Tools

Newspapers, radio and television – An insight into the impact of generative AI on media businesses

With generative AI paving the way for a new era of data, businesses are rapidly seeking ways to incorporate tools into their operations, DataIQ member News UK delves into their approach.

AI Analytics and Insight artificial intelligence generative AI machine learning Media ML News skills Technology Technology and Tools upskilling

Is your data team ready for generative AI?

The next era of AI tools is being implemented, but businesses must evaluate whether their team and organisation is prepared for a future involving generative AI.

AI Analytics and Insight artificial intelligence Culture and Skills generative AI skills Technology Technology and Tools

DataIQ is a trading name of IQ Data Group Limited
10 York Road, London, SE1 7ND

We use cookies so we can provide you with the best online experience. By continuing to browse this site you are agreeing to our use of cookies. Click on the banner to find out more.

Cookie Settings