At a time when we seem to be swamped with fake news, work is being done to make it easier to filter out the fiction disguised as fact online. On the micro-blogging site Twitter, information spreads like wildfire, whether it is true or not.
Axel Oehmichen, a research associate at the Data Science Institute and the Department of Computing at Imperial College London, gave an account of the data science pipeline for creating a model, an also co-authored an academic article detailing the project. He and his colleagues Julio Amador Díaz López and Miguel Molina-Solana went through the process of creating a model to identify tweets containing fake news at a recent event.
"It is indeed possible to model and detect fake news."
From sentiment analysis, they found that non-fake news tweets are usually more positive while fake news tweets tend to be a lot more negative. Their project also revealed that viral tweets containing fake news appeared to include more URLs than viral tweets that didn’t. Also, tweets containing fake news mostly contained one mention while other tweets contained two. Furthermore they found that there was a higher chance of fake news originating from an unverified Twitter account. They concluded that: “It is indeed possible to model and automatically detect fake news.”
In terms of the process, Oehmichen said that first step was collect to data on Twitter, looking only at viral tweets in English, composed between November 2016 and March 2017, that contained the following hashtags and handles; #MyVote2016, #ElectionDay, #electionnight, @realDonaldTrump and @HillaryClinton. Viral tweets were defined as having 1,000 or more retweets. Of the 57 million composed during that time period, 9,000 were selected. The process of collection took five months and labelling took a further month.
Manually labelling the tweets was a two-part process. The tweets were tagged as fake news, not fake news or unknown by two groups. The first group comprised students, friends and colleagues of the project authors. The second group comprised the authors themselves.
"People tweet more actively during election night and the day after."
One of the first things they realised was the spike in the number of tweets the night of and the day after the election. He said: “It just turns out that people were a lot more active during election night and the day after than any other day before or after. Because of that, we have had to remove that feature from what we selected for building our model.”
The features they decided to look at were from the meta data: the location, the number of favourites, users, followers and mentions, whether it is from a verified or unverified account, the number of friends that user has, and the media used in the tweet.
They used the Kolmogorov-Smirnov "goodness-of-fit" test which compares one set of data to a known distribution to find out if they have the same distribution. According to StatisticsHowTo, it is commonly-used as a test for normality. This test was used to see if there was a statistical difference between the features of fake news tweets and non-fake news tweets.
Users that had capitals or weird characters...are more likely to propagate fake news.
Oehmichen and his colleagues performed natural language processing analysis on the tweets. He said: “For sentiment analysis, we just extracted different bits from all the text we could find as part of the metadata. We realised that users that had capitals or weird characters like exclamation points in their username had a significant chance of being people propagating fake news.” He and his colleagues also looked at the core sentiments that were captured in the expressions of the tweets. They then moved on to “more elaborate techniques which are machine learning approaches.” Oehmichen weighed the words in a tweet together to give a sentiment on the tweet, then "piled them up" for a final score.He described this in greater detail, but was very aware that some members of the audience would not grasp everything as the process was complex with many steps.
"The data is saying something, but always go back to why?"
The presentation gave the audience an insight into the use of data science to solve a problem from start to finish, and highlighted some revelations the report authors made along the way. Oehmichen said it is imperative to ask the same question. “The data is saying something, but we always go back to why and every step of the way. We ask 'Why? Why? Why?' If there is no clear reason for this, we still keep it, but it is not very satisfying and we always try to do additional analysis to make sure that what we have is indeed ground truth.” Another important lesson learned was to always be aware of context when looking at any set of results.
Axel Oehmichen was speaking at fintech hub Rise.