Elsevier opens data model for life sciences innovation

David Reed, director of research and editor-in-chief, DataIQ

In the 1930s, if you were visiting London and wanted to find a specific street, you had two choices - get near your destination and ask a policeman or rely on a cab driver to find it. Not until Phyllis Isobella Gross walked the city’s 23,000 streets and created the first indexed A to Z did planning a route become quick and easy.

For life science researchers, developing new research projects on therapies or materials can still feel like being a London visitor in the pre-A to Z days. Information on previous experiments exists, but is recorded and stored in a wide variety of different data formats and locations. Just assembling the background needed to justify funding can be lengthy, painstaking and often infuriating.

It is to lower such barriers to innovations in life science that The Pistoia Alliance was formed. Named for the Italian town where its first conference was held in 2009, it brings together representatives of AstraZeneca, GSK, Novartis, Pfizer and Roche, among over 80 member companies, to create a framework for pre-competitive collaboration by overcoming common research and development obstacles, especially around data, knowledge sharing and technology pilots.

That goal has just received a major boost from the decision by Elsevier, the science and health information analytics business, to donate its Unifed Data Model (UDM) to the Alliance. UDM is an XML file format, originally developed by Elsevier in partnership with Roche, that helps to upload data sets into horizontal systems. Typically, it is the difficulty of integrating between in-house Electronic Lab Notebooks (ELN) and vertical systems (like those use by academics and publishers) that adds costs to projects.

“A common data model is the answer to a critical need.”

Tim Proctor, Elsevier“We are very much a data company and as with any company dealing with data we had to think carefully about the risks and benefits of exposing our intellectual property,” Tim Hoctor, VP of professional services at Elsevier told DataIQ. “That said, implementing a common data model with  major pharmaceutical companies that allows them to integrate their data is the answer to a critical need.”

By using a common data model across all of the systems involved in pharmaceutical and life sciences R&D, the discovery and analysis of information speeds up. It also opens the door to new approaches using machine learning, for example, which are ideal for this industry to exploit given the scale of the data involved. “That is what we are all trying to support so that data can be applied to studies that lead to breakthroughs in therapies and material sciences. That is to everybody’s benefit,” explained Hoctor.

When Elsevier surveyed its clients about the number of data sources they typically access for research projects, the average turned out to be three. “As we have five primary data sources, that was a cause for concern because it meant 40% of our data resource were going undiscovered. If a scientist doesn’t find the answer in those data sets, they may go on to duplicate an experiment which costs money and takes time,” he said.

Hoctor has been the driving force behind releasing the UDM to The Pistoia Alliance of which he is an elected board director. While he had to make the argument that releasing it into the public domain would be to the benefit of everybody in the industry, including Elsevier, the company took little persuading. With thousands of competing standards in use, moving to a shared open standard is an obvious step which everybody could buy into. He noted, “there was some caution, but nobody red flagged it.”

“If you don’t have common data models, you can’t access the science.”

“There is more and more interest around open standards and collaboration. If you look at big pharmaceutical mergers and acquisitions, if the companies involved had common data models, it would remove the issues of bringing together and integration the data sets from both sides,” he said. “If you don’t have that, you can’t access the science.”

Releasing additional value from existing studies is one driver behind M&A, especially with pressures on the new product pipeline making the demand for more rapid innovation even greater.  Hoctor notes that pharmaceutical and chemistry companies typically work with a wide range of partners where data sharing is critical. Until now, integration was challenging.

“We have put some things into the public domain on a smaller scale before, but we had not done anything at this level,” said Hoctor. “We hope it will enable a broader impact from research.”

Knowledge and strategy director, DataIQ
David is developing the framework for soft skills and career development among data and analytics practitioners. He continues to be editor-in-chief and research director for DataIQ.