Introduction
The world has become more digital, and there is an abundance of data available. Before being sent into space, all data cannot be checked. As the amount of data grows, some of it will be true while the rest will be false. All sources cannot be independently verified, and doing so manually is impossible.
Machine Learning occupies a unique position in that when utilised correctly, it may construct a model based on a trusted dataset that can subsequently be used to sort through news. This project tries to develop a model that analyzes text to determine whether it is true news or not.
Diving into the Project
The Data
The data used for this project was gotten from the Fake and real news dataset on Kaggle. For a simple guide on loading the data from Kaggle to Google Colab, check out this blog post
Data Cleaning
After the data has been loaded, there should be a bit of cleaning done.
true['label'] = 1
fake['label'] = 0
Data Cleaning at this stage is done to ensure text is converted to numbers for the model built to be able to interpret information. True news is hence labelled as 1, while Fake news is labelled as 0.
To increase the speed of the experiment, only the first 5000 data points in the data are used and then put into a data frame.
frames = [true.loc[:5000][:], fake.loc[:5000][:]]
df = pd.concat(frames)
X and y datasets are then created for the process of dividing the earlier data frame into features and labels.
X = df. drop('label', axis=1)
y = df['label']
Dropping missing values and creating a copy data frame for later usage is then done.
df = df.dropna()
df2 = df.copy()
df2.reset_index(inplace=True)
Text Preprocessing
Preprocessing is the process of converting data into a format that a computer can understand and then use. For working with text data, a form of preprocessing usually done is removing useless data. Useless data for text data are referred to as stop words. Stop words are commonly used words that programs and search engines have been instructed to ignore. Examples can include ('a', 'i', 'me', 'my', 'the', you')
Continuing with the Fake News project, to preprocess we use the process nltk is a python package that is used for text preprocessing.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
import nltk
After importing the required libraries, stemming is the next step. The next bit involves removing all punctuation, all capitalized characters, all stopwords and then stemming. Stemming is the process where words in the dataset are reduced to their base forms. For example, words like "likes", "liked", "likely", and "liking" are reduced to like. To eliminate data redundancy in a model, this is required.
Regex is used in this section, if you're not familiar, you can get an introduction to it here
nltk.download('stopwords')
ps = PorterStemmer()
corpus = []
for i in range(0, len(df2)):
review = re.sub('[^a-zA-Z]', ' ', df2['text'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)
The next step involves Word Embedding. Word Embedding is a method used in extracting features from text data for machine learning models to be able to work with the data. There are different word embedding techniques such as Word2Vec, GloVe, BERT but Tfidf is sufficient for this project.
Tfidf is a statistical method for capturing the significance of a text's terms in relation to the corpus/body of the text. It's ideal for retrieving information and extracting keywords from a document.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_features=5000, ngram_range=(1,3))
X = tfidf_v.fit_transform(corpus).toarray()
y = df2['label']
Once that is done, the next step involves splitting the dataset into train and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Training and Validating the Model
The data has been split and is prime for modelling. For this project, the PassiveAggressiveClassifier is used. The PassiveAggressiveClassifier is an online learning algorithm that works well to detect fake news. Other algorithms can be used in this step such as Regression, XGBoost, or Neural Networks. This classifier works very well on fake news. For a more detailed explanation, check here
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import metrics
import numpy as np
import itertools
classifier = PassiveAggressiveClassifier(max_iter=1000)
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
A confusion matrix is then used to visualize the results. If you want to learn more about the confusion matrix, you can check out my previous article
For the validation process.
# Validation
import random
r1 = random.randint(5001, len(fake))
review = re.sub('[^a-zA-Z]', ' ', fake['text'][r1])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
# Vectorization
val = tfidf_v.transform([review]).toarray()
# Predict
classifier.predict(val)
To save the model, we make use of the Pickle package
import pickle
pickle.dump(classifier, open('model2.pkl', 'wb'))
pickle.dump(tfidf_v, open('tfidfvect2.pkl', 'wb'))
Loading the model to confirm the results
# Load model and vectorizer
joblib_model = pickle.load(open('model2.pkl', 'rb'))
joblib_vect = pickle.load(open('tfidfvect2.pkl', 'rb'))
val_pkl = joblib_vect.transform([review]).toarray()
joblib_model.predict(val_pkl)
Deploying the model
This section requires a user to have experience using Flask. There are many options to deploy a model but this model will be deployed using flask. The app.py used can be found on the GitHub here and the index.html here
The code for this project is available at this repo
Bringing it All Together
This blog post has gone through the steps from downloading data, to cleaning it, building the model, validating the model, and concluded with deploying on Flask. Thank you for reading through. Any feedback is appreciated.
References