How to Perform Sentiment Analysis with Python?

Sentiment analysis can be an invaluable tool for organizations to identify and address their customers’ pain points. In a Repustate case study, a bank in South Africa noticed that many users had stopped doing business with them and they were concerned as to why this was happening. To gain further clarity on this issue, they collected social media data to understand what their customers were saying about them.

The bank realized that many of their clients were dissatisfied with the customer service: long waiting times (especially during lunch and peak hours) and even the operating hours were inconvenient. Performing sentiment analysis on over 2 million pieces of text data, they not only had identified the issue, but now knew how to resolve it.

Management improved the operating hours and increased the number of tellers in each branch. They also never had unmanned teller stations during lunchtime or peak hours to ensure that customers were served on time. As a result, there was a significant drop in customer churn rates and a rise in the number of new clients.

Essentially, sentiment analysis is the process of mining text data to extract the underlying emotion behind it in order to add value and pinpoint critical issues in a business. In this tutorial, I will show you how to build your own sentiment analysis model in Python, step by step.

How to Perform Sentiment Analysis in Python?

You’re probably already familiar with Python, but if not – it is a powerful programming language with an intuitive syntax. Not to mention it’s one of the most popular choices across the data science community, which makes it perfect for our tutorial.

We will use the Trip Advisor Hotel Reviews Kaggle dataset for this analysis, so make sure to have it downloaded before you start to be able to code along.

Step 1: Python Pre-Requisites

First things first: installing the necessary equipment. You need a Python IDE – I suggest using Jupyter. (If you don’t already have it, follow this Jupyter Notebook tutorial to set it up on your device.)

Make sure to have the following libraries installed as well: NumPy, pandas, Matplotlib, seaborn, Regex, and scikit-learn.

Step 2: Reading the Dataset

Let’s start by loading the dataset into Python and reading the head of the data frame:

import pandas as pd
df = pd.read_csv('tripadvisor_hotel_reviews.csv')
df.head()

The code above should render the following output:

This dataset only has 2 variables: “Review” which contains guests’ impressions of the hotel and “Rating” - the corresponding numerical evaluation (or, in simpler terms, the number of stars they’ve left).

Now, let’s take a look at the number of rows in the data frame:

len(df.index) # 20491

We learn that it comprises 20,491 reviews.

Step 3: Data Preprocessing

As we already know the TripAdvisor dataset has 2 variables – user reviews and ratings, which range from 1 to 5. We will use “Ratings” to create a new variable called “Sentiment.” In it, we will add 2 categories of sentiment as follows:

0 to 1 will be encoded as -1 as they indicate negative sentiment
3 will be labeled as 0 as it has a neutral sentiment
4 and 5 will be labeled as +1 as they indicate positive sentiment

Let’s create a Python function to accomplish this categorization:

import numpy as np

def create_sentiment(rating):
    
    if rating==1 or rating==2:
        return -1 # negative sentiment
    elif rating==4 or rating==5:
        return 1 # positive sentiment
    else:
        return 0 # neutral sentiment

df['Sentiment'] = df['Rating'].apply(create_sentiment)

Now, let’s take a look at the head of the data frame again:

Notice that we have a new column called “Sentiment” – this will be our target variable. We will train a machine learning model to predict the sentiment of each review.

First, however, we need to preprocess the “Review” column in order to remove punctuation, characters, and digits. The code looks like this:

from sklearn.feature_extraction.text import re

def clean_data(review):
    
    no_punc = re.sub(r'[^\w\s]', '', review)
    no_digits = ''.join([i for i in no_punc if not i.isdigit()])
    
    return(no_digits)

In this way, we will eliminate unnecessary noise and only retain information that is valuable to the final sentiment analysis.

Shall we take a look at the first review in the data frame to see what kind of punctuation we’d be removing?

df['Review'][0]

Notice that it contains commas. The preprocessing function will deal with those. Apply it onto this column and let’s look at the review again:

df['Review'] = df['Review'].apply(clean_data)
df['Review'][0]

All the commas are gone and we are left with clean text data.

Step 4: TF-IDF Transformation

Now, we need to convert this text data into a numeric representation so that it can be ingested into the ML model. We will do this with the help of scikit-learn’s TF-IDF Vectorizer package.

TF-IDF stands for “term frequency-inverse document frequency” – a statistical measure that tells us how relevant a word is to a document in a collection. In simpler terms, it converts words into a vector of numbers where each word has its own numeric representation.

TF-IDF is calculated based on 2 metrics:

Term frequency
Inverse document frequency

Let’s look at each individually.

Term Frequency

It’s really what it says on the tin – how many times a term is repeated in a single document. Words that appear more frequently in a piece of text are considered to have a lot of importance. For example, in this sentiment analysis tutorial, we repeat the words “sentiment” and “analysis” multiple times, therefore, an ML model will consider them highly relevant.

Inverse Document Frequency

Rather than focusing on individual pieces, inverse document frequency measures how many times a word is repeated across a set of documents. And opposite of the previous metric, here the higher frequency is – the lower the relevance. This helps the algorithm eliminate naturally occurring words such as “a”, “the”, “and”, etc, as they will appear frequently across all documents in a corpus.

Now that you understand how TF-IDF works, let’s use this algorithm to vectorize our data:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None, 
                        lowercase=False,
                        preprocessor=None)

X = tfidf.fit_transform(df['Review'])

We have successfully transformed the reviews in our dataset into a vector that can be fed into a machine learning algorithm!

Step 5: Building and Evaluating the Machine Learning Model

We can now train our algorithm on the review data to classify its sentiment into 3 categories:

Positive
Negative
Neutral

First, let’s perform a train-test split:

from sklearn.model_selection import train_test_split
y = df['Sentiment'] # target variable
X_train, X_test, y_train, y_test = train_test_split(X,y)

Now, fit a logistic regression classifier on the training dataset and use it to make predictions on the test data:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train,y_train) # fit the model
preds = lr.predict(X_test) # make predictions

Finally, evaluate the performance:

from sklearn.metrics import accuracy_score
accuracy_score(preds,y_test) # 0.86

Our model has an accuracy of approximately 0.86, which is quite good.

And that concludes our tutorial! For a better understanding of the concept, here is the complete sentiment analysis Python code I’ve used:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import re 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df = pd.read_csv('tripadvisor_hotel_reviews.csv') 
def create_sentiment(rating):
    
    res = 0 # neutral sentiment
    
    if rating==1 or rating==2:
        res = -1 # negative sentiment
    elif rating==4 or rating==5:
        res = 1 # positive sentiment
        
    return res
df['Sentiment'] = df['Rating'].apply(create_sentiment)
def clean_data(review):
    
    no_punc = re.sub(r'[^\w\s]', '', review)
    no_digits = ''.join([i for i in no_punc if not i.isdigit()])
    
    return(no_digits)
df['Review'] = df['Review'].apply(clean_data)
tfidf = TfidfVectorizer(strip_accents=None, 
                        lowercase=False,
                        preprocessor=None)
X = tfidf.fit_transform(df['Review'])
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X,y)
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train,y_train)
preds = lr.predict(X_test)
accuracy_score(preds,y_test)

Q&A

What Is Sentiment Analysis?

In a nutshell, sentiment analysis analyzes the underlying meaning of data, typically in text form. Rooted in natural language processing technology, this technique identifies and extracts insights from subjective information such as user reviews, social media posts, survey answers, etc. Further, it classifies such data into categories depending on the emotion they convey – this can be positive, negative, or neutral. Sentiment analysis finds purpose across customer analytics, marketing, and other business-related fields. Organizations use it to locate pain points in their brand reputation, customer service, as well as ways to solve them, thus accelerating decision-making. In addition, monitoring and analyzing user feedback opens up opportunities for product improvement and even provides ideas for new offerings that directly target a gap in the market.

Why Is Sentiment Analysis Important?

Sentiment analysis is important for businesses to improve decision-making, address pain points in the market, and monitoring user reaction in real-time to best gauge their emotional attachment to a brand or product. Not only that, but it is also highly effective when scaling large amounts of data – for example, analyzing the sentiment of thousands of tweets in order to understand how people are receiving a new product launch. This allows for a faster response to critical issues and identifying issues that might have slipped through the cracks during the testing stage. All of this, in turn, helps marketers better understand their target audience and think of new ways to successfully reach them. As a result, brand loyalty increases and customer churn rates significantly lower in volume as existing users are more satisfied with the service. Improved customer experience can even increase the company’s audience by bringing in new clients.

Sentiment Analysis in Python: Next Steps

If you managed to follow along to this tutorial – congratulations! You successfully performed an entire project on sentiment analysis in Python.

The techniques explained here are similar to the tasks I work on at my data science job. Real-world sentiment analyses are often tied to a business use case, like the South African bank example, so it is a good idea to learn not just the fundamentals, but also how to apply data science in marketing. This domain expertise will set you apart from other candidates looking to land a data science job.

If you’d like to expand your knowledge of data science for marketing, sign up for the 365 Data Science platform, which offers a range of courses for technical and professional development in the realm of data and all its fundamentals. A great start to learn how to apply technical concepts to solve business problems is their Introduction to Business Analytics course.