Joining Facebook is quick and easy - that is why it has been able to grow so rapidly and dominate the world of social networks. What is not easy for the company is fixing the data quality problems which it has allowed to be created, as David Reed finds out.
Facebook gained a lot of media coverage in September for reaching a significant milestone - the social network had registered one billion users. The figure means one seventh of the world’s population has created an online account. It is an astonishing growth rate, given that the network was only launched in 2004 and two years ago had 500 million users.
What would make the figure even more impressive would be if it were accurate. Despite being widely reported as a simple fact, the reality is more complicated - Facebook might have reached the billion mark some months ago, or it may still need another 87 million users in order to hit that target.
In its financial filing to American authorities in June, the social network made a surprising admission, namely that as many as 8.7 per cent of accounts could be duplicates, fakes or mis-categorised (see panel). Given the pressure on Facebook to monetise its users by selling more of them to advertisers, this looked like an admission about wastage which would have been unacceptable to media buyers even in the glory days of broadcast television.
Even more surprising was that Facebook is in effect just guessing about the accuracy of its figures. On close reading, the filing notes that it used “an internal review of a limited sample of accounts” to come up with this risk factor in its statement. So the problem could be bigger (or smaller) than stated.
“To give them due credit, I respect their honesty,” is how Nigel Turner, VP information management strategy at Trillium Software, a division of Harte Hanks, responded to the statement. “I know all too well how many companies hide their data quality issues from public gaze.”
This could be taken as in line with statements by Mark Zuckerberg, founder of Facebook, that privacy is no longer the social norm. More likely is that in the wake of its flotation, the company is now having to address some of the internal measures and controls that longer-established businesses already have in place.
At the time of its filing, the network was reporting 901 million monthly and daily average users. The scale of its adoption by consumers had undoubtedly helped it to achieve a $66.5 billion market capitalisation at launch. But it is also why Turner believes Facebook may have more serious questions to answer. “That market cap is predicated on the numbers of active users, which could have been inflated because it has got suspect accounts. So the company may not be worth what investors are paying for it,” he says.
Investors already appear to have been spooked by the numbers coming out of the business, leading to a $25 billion fall in its current book value. Much of that relates to the challenges for the network in maintaining its usage and growth rates while also maximising ad revenues. Analysts may not yet have made the link with data quality, but it may not be long until they do.
“I don’t think they really know how many users they have got,” notes Colin Rickard, enterprise and channel sales director at Experian QAS. “There are things they should do, like having an internal change of culture so that data gets attention at a senior level. There is an underlying need for better data management. That is a business maturity issue.”
It is fair to say that Facebook has achieved phenomenal growth for a company founded by people in their twenties and who are barely out of them now. With the flotation will have come a raft of more conventional management processes (and older people to implement them). Even so, social networks are at the leading edge of the digital revolution and are unlikely to have embedded “old economy” activities like data validation and correction.
Rickard suggests that fixing the problem at Facebook could involve adopting techniques that are themselves at the cutting edge of data quality. “It has got a lot of unstructured data about its users and there are companies which can help it to understand the links between them. If you look at what people do on the network, there are some very interesting things you can use for validation, for example the way teenagers behave will have a pattern,” he says.
Facebook’s data quality problems start at the outset. When registering to create an account, users provide name, email address and date of birth. While DOB is an extremely powerful piece of information for matching records, it can only be used in conjunction with other elements that help to filter out multiple or inaccurate data. The absence of a real-world address at sign-up significantly hampers this, since it precludes instant validation at point of capture or subsequent batch cleaning and deduplication.
There also appears to be an absence of conventional data analysis and database management. “Relying on an internal sample is not a very good way for a business with a large number of customers to act. They seem to have no ongoing measurement of data quality in place,” says Turner.
Extrapolating error rates from a sample is fraught with problems, not least that the larger the target figure, the bigger any mistakes become. So when calculating problems across a population of 1 billion, starting with a sample that could be flawed in itself rapidly escalates the difficulties. As Turner points out, the 8.7 per cent rate found by Facebook in its then 901 million users amounts to more than 78 million problematic accounts - more than the entire population of the United Kingdom.
This approach also gives the appearance of managing ad-hoc in an era when businesses are trying to be more evidence-based in their decision making. If this could be an issue now, when Facebook is earning revenues from advertisers, imagine the challenge if users start to make payments via the social network. “There doesn’t seem to be a structured approach to identity verification or to deal with potential fraud,” says Turner.
As long as the social network is just a way for individuals to link up and share with each other, there may not seem to be much harm. But the UK government, for one, has announced plans to use Facebook identities as part of its Identity Assurance programme which will give access to services online. Fake accounts could lead to a host of data security and fraud issues.
As the network continues to grow, Turner says it needs to start to change. “Best practice would be to put controls and consolidation in at source. When people create accounts, they need a process to validate if that person is a real entity or to pull up an existing account. Don’t leave that to the back end where it is expensive and difficult,” he says.
This front-end fix need not be expensive (although funding is unlikely to be a problem for Facebook), but it would ensure that all new accounts are at least genuine or within a 1 per cent tolerance. As the network continues to grow, Facebook would at least then be certain that these new users are authentic. Robust and proven technology exists to support this process.
Sorting out the problems that have already been created is more difficult. Most companies with customer databases undertake regular data cleansing, deduplication and verification. Trying to run these processes on a global database of 1 billion has almost certainly never been tried and would potentially cost hundreds of millions of dollars. Plenty of consultancies and vendors would be happy to make a play for the job. But could any of them actually achieve a good outcome?
“This is a good example of a very large Big Data challenge,” points out Rickard. “It is big both in terms of the numbers of users and the volume of unstructured data. It would require bringing together a number of leading edge solutions to tack, as well as the management challenge.”
Remarkable as the scale of Facebook’s data problems might be, the principle that caused them is not. “It is far from atypical - look at the banks. We know what happened there even though Basel II has been in place for a long time. Why didn’t that foretell the problems? Because the numbers banks were working on weren’t accurate,” says Rickard.
He notes that the insurance industry relies on the “expert judgement” of actuaries to determine whether risk profiles look right or not. In its financial filing, Facebook is essentially following the same path - it has considered the problem and come up with a value for the risk which may or may not be accurate. If nobody within the company can be certain about it, it is just as clear that nobody outside - investors, analysts, media - can be either.
For now, the continued growth and apparent commercial potential of Facebook is protecting it from pressure to change its processes. That may not last forever. And Turner for one is not surprised that the social network has found itself with this built-in problem: “It is typical of start-ups that expand in the way they have. There is probably nobody there familiar with data management - they were geeks with a brilliant idea.”