Fake news has become a huge issue in our digitally-connected world and it is no longer limited to little squabbles -- fake news spreads like wildfire and is impacting millions of people every day.
How do you deal with such a sensitive issue? Countless articles are being churned out every day on the internet -- how do you tell real from fake? It's not as easy as turning to a simple fact-checker which is typically built on a story-by-story basis. As developers, can we turn to machine learning?
In this series we will see two approaches to predict if a given article is fake or not. In this first article we will see a more traditional supervised approach of detecting fake news by training a model on labelled data and will use Twilio WhatsApp API to infer from our model. In the next article we will see how we can use Advanced pre-trained NLP models like BERT, GPT-2, XLNet, Grover etc, to achieve our goal.
Let's start with understanding a bit of a background.
What is Fake News?
According to 30seconds.org :
"Fake news" is a term used to refer to fabricated news. Fake news is an invention -- a lie created out of nothing -- that takes the appearance of real news with the aim of deceiving people. This is what is important to remember: the information is false, but it seems true.
According to Wikipedia:
"Fake news (also known as junk news, pseudo-news, or hoax news) is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media (print and broadcast) or online social media."
The usage of the web as a medium for perceiving information is increasing daily. The amount of information loaded in social media at any point is enormous, posing a challenge to the validation of the truthfulness of the information. The main reason that drives this framework is that on an average 62% of US adults rely on social media as their main source of news. The quality of news that is being generated in social media has substantially reduced over the years.
The generation of fake news is intentional by the unknown sources which are trivial and there are existing methodologies to individually validate the users' trustworthiness, the truthfulness of the news and user engagement in social media. But, analysing these features individually doesn't consider the holistic factors of measuring the news credibility. Hence, combining the auxiliary information together with the news content to measure the news credibility is a possible route to focus. There have been techniques to validate the writing style of the users to classify the news content but these methods also have their outliers and error rates.
Aim:
We will be building a WhatsApp based service which will accept news headlines from the user and predict if given news is fake news or not.
Requirements:
A Twilio account --- sign up for a free one here
A Twilio whatsapp sandbox --- configure one here
Set up your Python and Flask developer environment --- Make sure you have Python 3 downloaded as well as ngrok.
Tensorflow
Let's build:
Now we know what is fake news and why it's a major issue. Let's jump into building a solution to fight this problem. We will be using the LIAR Dataset by William Yang Wang which he used in his research paper titled "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.
The original dataset come with following columns:
Column 1: the ID of the statement ([ID].json).
Column 2: the label.
Column 3: the statement.
Column 4: the subject(s).
Column 5: the speaker.
Column 6: the speaker's job title.
Column 7: the state info.
Column 8: the party affiliation.
-
Column 9-13: the total credit history count, including the current statement.
- 9: barely true counts.
- 10: false counts.
- 11: half true counts.
- 12: mostly true counts.
- 13: pants on fire counts.
Column 14: the context (venue / location of the speech or statement).
For the simplicity we have converted it to 2 column format:
Column 1: Statement (News headline or text).
Column 2: Label (Label class contains: True, False)
You can find the modified dataset here. Now we have a dataset, let's start building a Machine Learning model.
Step 1: Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way. So for this, we use data preprocessing tasks.
The file preprocessing.py
contains all the preprocessing functions needed to process all input documents and texts. First we read the train, test and validation data files then performed some preprocessing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc.
#Stemming
def stem_tokens(tokens, stemmer):
stemmed = []
for token in tokens:
stemmed.append(stemmer.stem(token))
return stemmed
#process the data
def process_data(data,exclude_stopword=True,stem=True):
tokens = [w.lower() for w in data]
tokens_stemmed = tokens
tokens_stemmed = stem_tokens(tokens, eng_stemmer)
tokens_stemmed = [w for w in tokens_stemmed if w not in stopwords ]
return tokens_stemmed
#creating ngrams
#unigram
def create_unigram(words):
assert type(words) == list
return words
#bigram
def create_bigrams(words):
assert type(words) == list
skip = 0
join_str = " "
Len = len(words)
if Len > 1:
lst = []
for i in range(Len-1):
for k in range(1,skip+2):
if i+k < Len:
lst.append(join_str.join([words[i],words[i+k]]))
else:
#set it as unigram
lst = create_unigram(words)
return lst
Step 2: Feature Selection:
For feature selection, we have used methods like simple bag-of-words and n-grams and then term frequency like tf-idf weighting. we have also used word2vec and POS tagging to extract the features, though POS tagging and word2vec has not been used at this point in the project.
We are looking at following features:
def features(sentence, index):
""" sentence: [w1, w2, ...], index: the index of the word """
return {
'word': sentence[index],
'is_first': index == 0,
'is_last': index == len(sentence) - 1,
'is_capitalized': sentence[index][0].upper() == sentence[index][0],
'is_all_caps': sentence[index].upper() == sentence[index],
'is_all_lower': sentence[index].lower() == sentence[index],
'prefix-1': sentence[index][0],
'prefix-2': sentence[index][:2],
'prefix-3': sentence[index][:3],
'suffix-1': sentence[index][-1],
'suffix-2': sentence[index][-2:],
'suffix-3': sentence[index][-3:],
'prev_word': '' if index == 0 else sentence[index - 1],
'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
'has_hyphen': '-' in sentence[index],
'is_numeric': sentence[index].isdigit(),
'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
}
Step 3: Classification:
Here we have built all the classifiers for predicting the fake news detection. The extracted features are fed into different classifiers. We have used Naive-bayes, Logistic Regression, Linear SVM, Stochastic gradient descent and Random forest classifiers from sklearn. Each of the extracted features were used in all of the classifiers. Once fitting the model, we compared the f1 score and checked the confusion matrix.
n-grams & tfidf confusion matrix and F1 scores
#Naive bayes
[841 3647]
[427 5325]
f1-Score: 0.723262051071
#Logistic regression
[1617 2871]
[1097 4655]
f1-Score: 0.70113000531
#svm
[2016 2472]
[1524 4228]
f1-Score: 0.67909201429
#sgdclassifier
[ 10 4478]
[ 13 5739]
f1-Score: 0.718731637053
#random forest
[1979 2509]
[1630 4122]
f1-Score: 0.665720333284
After fitting all the classifiers, 2 best performing models were selected as candidate models for fake news classification. We have performed parameter tuning by implementing GridSearchCV method on these candidate models and chosen best performing parameters for these classifiers. Finally the selected model was used for fake news detection with the probability of truth. In Addition to this, We have also extracted the top 50 features from our term-frequency tf-idf vectorizer to see what words are most important in each of the classes. We have also used Precision-Recall and learning curves to see how training and test sets perform when we increase the amount of data in our classifiers.
Step 4: Prediction:
Our finally selected and best performing classifier was Logistic Regression which was then saved on disk with name final_model.sav. Once you close this repository, this model will be copied to the user's machine and will be used by prediction.py file to classify the fake news. It takes a news article as input from the user then a model is used for final classification output that is shown to the user along with probability of truth.
def detecting_fake_news(var):
#retrieving the best model for prediction call
load_model = pickle.load(open('final_model.sav', 'rb'))
prediction = load_model.predict([var])
prob = load_model.predict_proba([var])
return prediction, prob
Step 5: Integrating Twilio WhatsApp API:
We have to write a code to accept a news article headline or text from Twilio WhatsApp API and save it to our model for prediction. For this we will python flask API server. [You can follow the similar process for SMS API as well]
Following script will do that:
from flask import Flask, request
import prediction
from twilio.twiml.messaging_response import MessagingResponse
app = Flask(__name__)
@app.route('/sms', methods=['POST'])
def sms():
resp = MessagingResponse()
inbMsg = request.values.get('Body')
pred, confidence = prediction.detecting_fake_news(inbMsg)
resp.message(
f'The news headline you entered is {pred[0]!r} and corresponds to {confidence[0][1]!r}.')
return str(resp)
if __name__ == '__main__':
app.run()
Now you have to generate an endpoint which can be accessed using Twilio WhatsApp Sandbox.
Your Flask app will need to be visible from the web so Twilio can send requests to it. Ngrok lets us do this. With it installed, run the following command in your terminal in the directory your code is in. Run ngrok http 5000
in a new terminal tab.
Grab that ngrok URL to configure twilio whatsapp sandbox. We will try this on WhatsApp! So let’s go ahead and do it (either on our Sandbox if you want to do testing or your main WhatsApp Sender number if you have one provisioned). In a screenshot below we show the Sandbox page:
And we’re good to go! Let’s test our application on WhatsApp! We can send some news headlines or facts to this sandbox and get predictions in return if everything works as expected.
Hurray! You wanna try this? Complete code is available on GitHub.
What's next
This was a very basic implementation with limited data but I really hope this will be sufficient to give you an idea about cool things you can do with Tensorflow and Twilio. You can try to tweak this project and use various datasets to build something more cool! so what you're planning to build? Tell me in the comments below or hit me up on twitter with your ideas and I will be happy to collaborate!
In next part, we will see how we can use Advanced pre-trained NLP models like BERT, GPT-2, XLNet, Grover etc, to achieve our goal!