Sentiment analysis can be an invaluable tool for organizations to identify and address their customers’ pain points. In a Repustate case study, a bank in South Africa noticed that many users had stopped doing business with them and they were concerned as to why this was happening. To gain further clarity on this issue, they collected social media data to understand what their customers were saying about them.
The bank realized that many of their clients were dissatisfied with the customer service: long waiting times (especially during lunch and peak hours) and even the operating hours were inconvenient. Performing sentiment analysis on over 2 million pieces of text data, they not only had identified the issue, but now knew how to resolve it.
Management improved the operating hours and increased the number of tellers in each branch. They also never had unmanned teller stations during lunchtime or peak hours to ensure that customers were served on time. As a result, there was a significant drop in customer churn rates and a rise in the number of new clients.
Essentially, sentiment analysis is the process of mining text data to extract the underlying emotion behind it in order to add value and pinpoint critical issues in a business. In this tutorial, I will show you how to build your own sentiment analysis model in Python, step by step.
Table of Contents:
- How to Perform Sentiment Analysis in Python?
- Python Pre-Requisites
- Reading the Dataset
- Data Preprocessing
- TF-IDF Transformation
- Building and Evaluating the ML Model
- Sentiment Analysis with Python: Next Steps
How to Perform Sentiment Analysis in Python?
You’re probably already familiar with Python, but if not – it is a powerful programming language with an intuitive syntax. Not to mention it’s one of the most popular choices across the data science community, which makes it perfect for our tutorial.
We will use the Trip Advisor Hotel Reviews Kaggle dataset for this analysis, so make sure to have it downloaded before you start to be able to code along.
Step 1: Python Pre-Requisites
First things first: installing the necessary equipment. You need a Python IDE – I suggest using Jupyter. (If you don’t already have it, follow this Jupyter Notebook tutorial to set it up on your device.)
Make sure to have the following libraries installed as well: NumPy, pandas, Matplotlib, seaborn, Regex, and scikit-learn.
Step 2: Reading the Dataset
Let’s start by loading the dataset into Python and reading the head of the data frame:
import pandas as pd
df = pd.read_csv('tripadvisor_hotel_reviews.csv')
df.head()
The code above should render the following output:
This dataset only has 2 variables: “Review” which contains guests’ impressions of the hotel and “Rating” - the corresponding numerical evaluation (or, in simpler terms, the number of stars they’ve left).
Now, let’s take a look at the number of rows in the data frame:
len(df.index) # 20491
We learn that it comprises 20,491 reviews.
Step 3: Data Preprocessing
As we already know the TripAdvisor dataset has 2 variables – user reviews and ratings, which range from 1 to 5. We will use “Ratings” to create a new variable called “Sentiment.” In it, we will add 2 categories of sentiment as follows:
- 0 to 1 will be encoded as -1 as they indicate negative sentiment
- 3 will be labeled as 0 as it has a neutral sentiment
- 4 and 5 will be labeled as +1 as they indicate positive sentiment
Let’s create a Python function to accomplish this categorization:
import numpy as np
def create_sentiment(rating):
if rating==1 or rating==2:
return -1 # negative sentiment
elif rating==4 or rating==5:
return 1 # positive sentiment
else:
return 0 # neutral sentiment
df['Sentiment'] = df['Rating'].apply(create_sentiment)
Now, let’s take a look at the head of the data frame again:
Notice that we have a new column called “Sentiment” – this will be our target variable. We will train a machine learning model to predict the sentiment of each review.
First, however, we need to preprocess the “Review” column in order to remove punctuation, characters, and digits. The code looks like this:
from sklearn.feature_extraction.text import re
def clean_data(review):
no_punc = re.sub(r'[^\w\s]', '', review)
no_digits = ''.join([i for i in no_punc if not i.isdigit()])
return(no_digits)
In this way, we will eliminate unnecessary noise and only retain information that is valuable to the final sentiment analysis.
Shall we take a look at the first review in the data frame to see what kind of punctuation we’d be removing?
df['Review'][0]
Notice that it contains commas. The preprocessing function will deal with those. Apply it onto this column and let’s look at the review again:
df['Review'] = df['Review'].apply(clean_data)
df['Review'][0]
All the commas are gone and we are left with clean text data.
Step 4: TF-IDF Transformation
Now, we need to convert this text data into a numeric representation so that it can be ingested into the ML model. We will do this with the help of scikit-learn’s TF-IDF Vectorizer package.
TF-IDF stands for “term frequency-inverse document frequency” – a statistical measure that tells us how relevant a word is to a document in a collection. In simpler terms, it converts words into a vector of numbers where each word has its own numeric representation.
TF-IDF is calculated based on 2 metrics:
- Term frequency
- Inverse document frequency
Let’s look at each individually.
Term Frequency
It’s really what it says on the tin – how many times a term is repeated in a single document. Words that appear more frequently in a piece of text are considered to have a lot of importance. For example, in this sentiment analysis tutorial, we repeat the words “sentiment” and “analysis” multiple times, therefore, an ML model will consider them highly relevant.
Inverse Document Frequency
Rather than focusing on individual pieces, inverse document frequency measures how many times a word is repeated across a set of documents. And opposite of the previous metric, here the higher frequency is – the lower the relevance. This helps the algorithm eliminate naturally occurring words such as “a”, “the”, “and”, etc, as they will appear frequently across all documents in a corpus.
Now that you understand how TF-IDF works, let’s use this algorithm to vectorize our data:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
X = tfidf.fit_transform(df['Review'])
We have successfully transformed the reviews in our dataset into a vector that can be fed into a machine learning algorithm!
Step 5: Building and Evaluating the Machine Learning Model
We can now train our algorithm on the review data to classify its sentiment into 3 categories:
- Positive
- Negative
- Neutral
First, let’s perform a train-test split:
from sklearn.model_selection import train_test_split
y = df['Sentiment'] # target variable
X_train, X_test, y_train, y_test = train_test_split(X,y)
Now, fit a logistic regression classifier on the training dataset and use it to make predictions on the test data:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train,y_train) # fit the model
preds = lr.predict(X_test) # make predictions
Finally, evaluate the performance:
from sklearn.metrics import accuracy_score
accuracy_score(preds,y_test) # 0.86
Our model has an accuracy of approximately 0.86, which is quite good.
And that concludes our tutorial! For a better understanding of the concept, here is the complete sentiment analysis Python code I’ve used:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df = pd.read_csv('tripadvisor_hotel_reviews.csv')
def create_sentiment(rating):
res = 0 # neutral sentiment
if rating==1 or rating==2:
res = -1 # negative sentiment
elif rating==4 or rating==5:
res = 1 # positive sentiment
return res
df['Sentiment'] = df['Rating'].apply(create_sentiment)
def clean_data(review):
no_punc = re.sub(r'[^\w\s]', '', review)
no_digits = ''.join([i for i in no_punc if not i.isdigit()])
return(no_digits)
df['Review'] = df['Review'].apply(clean_data)
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
X = tfidf.fit_transform(df['Review'])
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X,y)
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train,y_train)
preds = lr.predict(X_test)
accuracy_score(preds,y_test)
Q&A
Sentiment Analysis in Python: Next Steps
If you managed to follow along to this tutorial – congratulations! You successfully performed an entire project on sentiment analysis in Python.
The techniques explained here are similar to the tasks I work on at my data science job. Real-world sentiment analyses are often tied to a business use case, like the South African bank example, so it is a good idea to learn not just the fundamentals, but also how to apply data science in marketing. This domain expertise will set you apart from other candidates looking to land a data science job.
If you’d like to expand your knowledge of data science for marketing, sign up for the 365 Data Science platform, which offers a range of courses for technical and professional development in the realm of data and all its fundamentals. A great start to learn how to apply technical concepts to solve business problems is their Introduction to Business Analytics course.