“Fake News” at QCRI: Follow the Source

By Dr. Preslav Nakov, Senior Scientist, Arabic Language Technologies, QCRI

The “Fake News” Phenomenon

Recent years have seen the rise of social media, which have enabled people to share information with a large number of online users, without quality control. On the bright side, this has given the opportunity for everybody to be a content creator and has also enabled a much faster information dissemination. On the not-so-bright side, it has made it possible for malicious users to spread misinformation much faster, potentially reaching large audiences. In some cases, this included building sophisticated profiles for individual users based on a combination of psychological characteristics, meta-data, demographics, and location, and then micro-targeting them with personalized “fake news” and propaganda campaigns that have been weaponized with the aim to achieve political or financial gains.

False information in the news has always been around, e.g., think of the tabloids. However, social media have changed everything. They have made it possible for malicious users to spread misinformation much faster and potentially reaching large audiences. In some cases, this included building sophisticated profiles for individual users based on a combination of psychological characteristics, meta-data, demographics, and location, and then micro-targeting them with personalized “fake news” and propaganda campaigns that have been weaponized with the aim to achieve political or financial gains. As social media are optimized for user engagement, “fake news” thrive on these platforms, as users cannot recognize them and thus they share them further (this is also amplified by bots). Studies have shown that 70% of the users cannot distinguish real from “fake news”, and that “fake news” spread in social media six times faster than real ones. “Fake news” are like spam on steroids: if a spam message reaches 1000 people, it would die there; in contrast, “fake news” can be shared and eventually reach millions.

Nowadays, false information has become a global phenomenon: recently, at least 18 countries had election-related issues with “fake news”. This includes the 2016 US Presidential elections, Brexit, the recent elections in Brazil. To get an idea of the scale, 150 million users on Facebook and Instagram saw inflammatory political ads, and Cambridge Analytica had access to the data for 87 million Facebook users.

False information can not only influence the outcome of political elections, but it can cause direct life loss. For example, false information on WhatsApp has resulted in people being killed in India, and according to a UN report, false information on Facebook is responsible for the Rohingya genocide. False information also puts people’s health in danger, e.g., think of the anti-vaccine websites and the damage they cause to public health worldwide. Overall, people today are much more likely to believe in conspiracy theories. To give you an example: according to a study, today 57% of Russians believe USA did not put a man on the Moon. In contrast, when the event actually occurred, despite the Cold War, USSR officially congratulated Neil Armstrong, and he was even invited and visited Moscow!

Note that veracity of information is a much bigger problem than just “fake news”. It has been suggested that “Veracity” should be seen as the 4th “V” of Big Data, along with Volume, Variety, Velocity.

Our Solution: Focus on the Source

Previous efforts towards automatic fact-checking have focused on fact-checking claims, e.g.,

rumors, e.g.,

or entire news articles, e.g.,

Roughly speaking, to fact-check an article, we can analyze its contents (e.g., the language it uses) and the reliability of its source (a number between 0 and 1, where 0 is very unreliable and 1 is very reliable):

factuality(article) = reliability(language(article)) + reliability(website(article))

To fact-check a claim, we can retrieve articles discussing the claim, then we need to detect the stance of each article with respect to the claim and to take a weighted sum (here the stance is -1 if the article disagrees with the claim, 1 if it agrees, and 0 if it just discusses the claim or is unrelated to it):

factuality(claim) = sum_i [reliability(article_i) * stance(article_i, claim)]

In the former formula, the reliability of the website that hosts an article serves as a prior to compute the factuality of an article. Then, in the latter formula, we use the factuality of the retrieved articles to compute a factuality score for a claim. The idea is that if a reliable article agrees/disagrees with the claim, this is a good indicator for it being true/false; it is the other way around for unreliable articles.

Of course, the formulas above are oversimplifications, e.g., one can fact-check a claim based on the reactions of users in social media, based on spread over time in social media, based on information in a knowledge graph or information from Wikipedia, based on similarity to previously checked claims, etc. Yet, they give the general idea that we need to estimate the reliability of the website on which the article was published. Interestingly, this problem has been largely ignored in previous work, and has been only addressed indirectly.

Thus, we focus on characterizing entire news outlets. This is much more useful than fact-checking claims or articles, as it is hardly feasible to fact-check every single piece of news. It also takes time, both to human users and to automatic programs, as they need to monitor how trusted mainstream media report on the event, how users react to it in social media, etc., and it takes time to get enough of this necessary evidence accumulated in order to be able to make a reliable prediction. It is much more feasible to check the news outlets. That way, we can also fact-check the sources in advance. In a way, we can fact-check the news before they were even written! It is enough to check how trustworthy the outlets that published them are. It is like in the movie “Minority Report”, where authorities could detect a crime before it was committed.

In general, fighting misinformation is not easy; as in the case of spam, this is an adversarial problem, where the malicious actors constantly change and improve their strategies. Yet, when they share news in social media, they typically post a link to an article that is hosted on some website. This is what we are exploiting: we try to characterize the news outlet where the article is hosted. This is also what journalists typically do: they first check the source.

Finally, even though we focus on the source, our work is also compatible with fact-checking a claim or a news article, as we can provide an important prior and thus help both algorithms and human fact-checkers that try to fact-check a particular news article or a claim.

Disinformation typically focuses on emotions, and political propaganda often discusses moral categories. There are many incentives for news outlets to publish articles that appeal to emotions: (i) this has a strong propagandistic effect on the target user, (ii) it makes it more likely to be shared further by the users, and (iii) it will be favored as a candidate to be shown in other users’ newsfeed as this is what algorithms on social media optimize for. And news outlets want to get users to share links to their content in social media as this allows them to reach larger audience. This kind of language also makes them detectable for AI algorithms such as ours; yet, they cannot do much about it as changing the language would make their message less effective and it would also limit its spread.

While analysis of language is the most important information source, we also consider information in Wikipedia, social media, traffic statistics and structure of the target site’s URL:

(i) the text of a few hundred articles published by the target news outlet (e.g., cnn.com), analyzing the style, subjectivity, sentiment, morality, vocabulary richness, etc.

https://news.mit.edu/2018/mit-csail-machine-learning-system-detects-fake-news-from-source-1004

(ii) the text of its Wikipedia page (if any), including its infobox, summary, content, categories, e.g., it might say that the website spreads false information and conspiracy theories:

(iii) its Twitter account (if any)

(iv) the Web traffic it attracts, e.g., compare the traffic for the original ABC

to that for the fake one.

(v) the structure of its URL (is it long, does it contain meaningful words as in http://www.angrypatriotmovement.com/ and http://100percentfedup.com/? Does it end by “.co” instead of “.com”, e.g., http://abcnews.com.co?).

See our recent EMNLP-2018 publication for detail (joint work with researchers from MIT-CSAIL and students from the Sofia University): http://aclweb.org/anthology/D18-1389

We have extended the system to model factuality and left-vs-right bias jointly as they are correlated, i.e., center media tend to be more factual, while partisan websites are less so. This has been shown to hold in general, and it is also the case for our data, which is based on the Media Bias/Fact Check.

Finally, we are currently extending the system to handle news outlets in different languages, and initial experiments have been carried out with Arabic, as well as some other languages including Hungarian, Italian, Turkish, Greek, French, German, Danish, and Macedonian. This is based on a language-independent model, and it can potentially handle many more languages.

The Tanbih Project

Characterizing media in terms of factuality of reporting and bias is part of a larger effort at QCRI. In particular, we are developing a news aggregator, Tanbih (meaning alert or precaution in Arabic), which lets users know what they are reading. It features media profiles, which include automatic predictions for the factuality of reporting and left/right bias, as well as estimations of the likelihood that a news article is propagandistic, among other features. In addition to a website, we are developing versions for a mobile phone that run on Android and iPhone.

You can learn more about the project here: http://tanbih.qcri.org/

You can try the Web version here: http://www.tanbih.org

The App is in its early stages, but the plan for it is ambitious. It aims at helping users “step out of their bubble” and achieve a healthy “news diet”. In addition to the typical daily news feed as in Google News or FlipBoard, the app will find controversial topics and will help the readers step out of their bubble by showing them different viewpoints on the same topic, by informing them about the stance of an article with respect to a controversial topic, about the publisher/author in general (reliability, trustworthiness, biases, etc.), and also by providing background for the story.

As people are reading the news, they see whether an article is likely to be propagandistic and they can also click on a media outlet icon and get information about the media profile. Thus, they can always know what they are reading. For example, here is the profile we have for Aljazeera: https://www.tanbih.org/media/5

Here is what we have about Breitbart: https://www.tanbih.org/media/1320

And here is how it looks like for Sputnik: https://www.tanbih.org/media/293

We have also developed also a tool for propaganda detection at the article level: http://proppy.qcri.org/

Moreover, we are looking into fine-grained detection of the use of specific propagandistic techniques inside a news article, which is the topic of an upcoming datathon: https://www.datasciencesociety.net/datathon/

Finally, there is also something to help the tedious work of human fact-checkers. We have developed a tool for detecting which claims are most worthy of fact-checking, which can help human fact-checkers in prioritizing their work, e.g., when analyzing a political debate. The tool supports both Arabic and English, and it can mimic the selection strategies of different fact-checking organizations: http://claimrank.qcri.org/

The Tanbih project is closely aligned with a collaboration project with MIT-CSAIL on Arabic Speech and Language Processing for Cross-Language Information Search and Fact Verification.

We are also collaborating with the Sofia University, Qatar University, Northwestern University, the Data Science Society, as well as with some companies such as A Data Pro.

Our work has attracted a lot of media attention and was featured in 100+ media outlets including Forbes, Boston Globe, Defense One, Nextgov, Al Jazeera, MIT Technology Review, Science Daily, Popular Science, Fast Company, The Register, Engadget, VentureBeat, Silicon Republic, Geek, to mention just a few.

There have been also a number of national, mostly non-English publications, e.g., 24 Chasa (Bulgaria), Beritagar (Indonesia), BNT (Bulgaria), bTV (Bulgaria), Business Standard (India), Daily News (Egypt), Diario 16 (Spain), El Periódico (Spain), First Post (India), Formiche (Italy), Hindustan Times (India), Huanqiu (China), iThome (Taiwan), Kompas (Indonesia), Le Big Data (France), Milenio (Mexico), Naftemporiki (Greece), Rzecz Pospolita (Poland), TechGirl (Netherlands), The Hindu (India), Think Hong Kong (Hong Kong SAR, China), Times Now (India), Público (Spain), Sina (China), Svět Hardware (Czech Republic), UNAM Global (Mexico), Vesti (Russia), What Next (Poland).

See more here: http://tanbih.qcri.org/media-coverage/

The Future of “Fake News”

Overall, it is largely believed that “fake news” can and has affected elections, e.g., Brexit, US Presidential elections in 2016, etc. The true impact is unknown, but we should expect a large number of new actors to give it a try, and institutions should act to protect democracy. Here are some ideas: they can mandate transparency for political advertising, can make imprint rules apply online, and can try to make it economically not profitable to spread disinformation (as part of the “fake news” are generated and spread for the revenue they generate with ads). Democratic institutions can also put pressure on tech companies to address the problem, as otherwise they might not want to do it: (i) tech companies do not want to become Ministries of Truth, and (ii) as “fake news” generate more user engagement, this means more profit from advertisements shown to users. Yet, any intervention has to be done very carefully! The focus should be on long-term solutions, and thinking should go beyond what Facebook, Twitter, Google can do (they might not be around in 5-10 years, e.g., MySpace is now gone). Protecting freedom of speech should be a priority. Legislation can easily go wrong, as it has happened in Malaysia, where leading politicians from the opposition were put in prison for spreading “fake news”. Similarly, harsh laws in Germany (fines of up to 50M euro) and Russia (modeled after the German one) have caused worries of potential self-censorship.

Now, some predictions from a technological perspective. First, on the negative side, we expect further advances in “deep fakes” (machine-generated videos, and images). This is a really scary development, but probably only in the long run; at this point, “deep fakes” are still easy to detect both using AI and also for experienced users.

We also expect advances in automatic news generation. This is already a reality and a sizable part of the news we are consuming daily are machine generated, e.g., about the weather, the markets, and sport events. Such software can describe a sport event from various perspectives: neutrally or taking the side of the winning or the losing team. It is easy to see how this can be used for political misinformation and disinformation.

Yet, we hope to see this coming year “fake news” gone the way of spam: not entirely eliminated (as this is impossible), but put under control. AI has already helped a lot in the fight against spam, and we expect that it would play key role in putting “fake news” under control as well. This would ultimately strengthen democracy.

A key element of the solution would be limiting the spread. The social media platforms are best positioned to do this on their own platforms, and they have already started working on this. Twitter has suspended more than 70 million accounts in May and June 2018, and the pace continued in July; this can help in the fight against bots and botnets, which are the new link farms: 20% of the tweets during the 2016 U.S. Presidential campaign were shared by bots. Facebook, from its part, warns users when they try to share a news article that has been fact-checked and identified as fake by at least two trusted fact-checking organizations, and it also downgrades “fake news” in the news feed. We expect the AI tools used for this to get better, just like spam filters have improved over time.

Yet, the most important element of the fight against disinformation is raising user awareness. First, this would limit the effect of political propaganda as such: as Joseph Goebbels, Reich Minister of Propaganda of Nazi Germany, has put it, “Propaganda becomes ineffective the moment we are aware of it”. Second, it would help limit the spread of disinformation as users would be less likely to share it online. We hope that practical tools such as our news aggregator Tanbih would help in that aspect.