Machine learning - focusing on code misses the big picture

Toni Sekinah, research analyst and features editor, DataIQ

According to Rangarajan Vasudevan, CEO of data strategy company The Data Team, machine learning has received an extensive amount of hype and media coverage in recent times. However, he feels it is important to point out that machine learning systems do not operate in a silo and other processes need to take place for it to be effective.

"Only a small fraction of real-world machine learning systems is composed of code."

He wanted to drive home his point that “only a small fraction of real-world machine learning systems is composed of machine learning code” and that the surrounding infrastructure required is vast and complex.

Alaska pipeline winterThat surrounding infrastructure, or pipeline, is composed of nine components according to Vasudevan: configuration, data collection, feature extraction, data verification, machine resource management, analysis tools, process management tools, serving infrastructure and monitoring. 

With regard to configuration, Vasudevan said it is not as easy as it sounds. “Being able to configure software, hardware, applications and getting that right is an extremely draining task,” he said. In the context of a financial services institution, knowing how often to run fraud checks or A/B testing in conjunction with an ecommerce partner are examples of configuration settings.

Vasudevan said that feature extraction is the most time-consuming task in the pipeline, involving the extraction of signals from the data. Using fraud detection once again as an example, feature detection would be about looking at indicators of fraud. He said: “In many cases, those who write the algorithm need to understand the business to be able to say what feature is important.”

Data verification and data cleansing are vitally important parts of the machine learning pipeline, according to Vasudevan. This process can be made more challenging if those tasked with creating the algorithm are not given a comprehensive dataset at the start.

Vasudevan recalled a time when he and his team were asked to create a defect prediction algorithm for products coming off an assembly line. However, he was only given a dataset of defective products. It took a lot of persuasion for the business leader to finally allow him access to the full dataset so he could "see what good looked like".

“That's data verification. You actually have to understand what the data is about,” he said. Vasudevan also underlined the critical importance of analysis tools such as visualisations as well as monitoring to be able to detect and rectify any problems or issues. The CEO stated that a heavy focus has been placed on machine learning, to the detriment of other components of the data science pipeline.

However, does his list of nine components create a representative depiction of the landscape? Or are there other neglected aspects of the pipeline, the value of which needs more recognition? If and when the lustre of machine learning and artificial intelligence begins to dim, the necessity of other parts of the machine learning systems process may be revealed.  

 

Research analyst and features editor, DataIQ