Evolution of NLP — Part 1 — Bag of Words, TF-IDF

Getting started with NLP Basics for Sentiment Classification

Kanishk Jain
Published in
6 min readJul 23, 2020

--

This is the first blog in a series of posts where I try to talk about the changes in modeling techniques for Natural Language Processing tasks over the past few years. Right from the basics, Bag of Words, we reach to the current State of The Art (SOTA) — Transformers! I’ll try to give a brief about the algorithm itself, and then we drive straight into coding! We’ll see over these series of posts, how much of a change NLP has seen over the years, and how you can get started or catch-up to the latest quickly. I hope you enjoy this journey ;)

For this post, we’ll focus on using simple Bag-of-Words and TF-IDF based models, coupled with Ensemble Decision Trees to get the highest accuracy score!

You can also find this tutorial on the Kaggle Kernel — Evolution of NLP — Part 1 — Bag of Words, TF-IDF with complete code!

Understanding the Data

I’ve used the JantaHack NLP Hackathon dataset here. This dataset essentially consists of Steam User Reviews for different kinds of games, collected during 2015–2019. The goal here is to predict whether based on the user review, the user recommends or doesn’t recommend the game. So, our goal is essentially Sentiment Classification. This will be our task for this whole series!

Combined Training and Test Data had 25000+ reviews. We’ll only focus on using user_review for our predictions.

Complete Dataset for our analysis — Image from Author

Let’s dive right into it! For this and the series of tutorials, I’m only going to use the reviews, and no other columns. In practice, it’s good to perform EDA to get a better sense of your data.

Using Bag of Words, N-Grams, TF-IDF

The approach below essentially covers some of the very first tools that anyone trying to experiment with NLP starts with. And, over time, lots and lots of libraries, like SpaCy and NLTK have popped up, which have simplified using this approach tremendously. There are also libraries like Textblob, which stand on the shoulders of mighty NLTK and provide a better and faster interface to perform a lot of NLTK operations and then some more.

I’ll try and give a quick overview of the methods and libraries, however, I would recommend going to the websites of each of the libraries (attached below) to understand their complete set of capabilities.

Step 1. Pre-Processing

Cleaning up the user reviews!

  1. decontracted function would convert short forms of general phrases into their longer versions
  2. lemmatize_with_postag function reduces the words to their base form. I’ve used TextBlob library’s implementation here, which is build on top of NLTK. Feel free to try out other functions. The main idea is to reduce additional vocabulary — which is helpful both computationally, and also helps in reduce over-fitting to certain key-words. There are other methods like Stemming, but it is generally inferior to Lemmatization.
  3. Further cleaning to remove links, punctuation, etc.
Before and After Pre-Processing — Image from Author

Step 2. Structuring the Data for ML

Using Count Vectorizer (Bag of Words)

Bag of words, put simply, indicates the count of appearance of a certain word in a review, irrespective of it’s order. To do this, firstly we create a dictionary (or vocabulary) of all the words (or tokens) present in the reviews. Each token from the vocabulary is then converted to a column with it’s row[j] indicating — “How many times was the token — “the” present in review[j]?”

I’ve used a scikit-learn implementation below, but there are other libraries which can handle this well as well.

Final Dataset after using Bag-of-Words — Image from Author

The column names represent the token and rows represent individual sentences. If that token is present in the sentence, the respective column will have a value 1, otherwise 0.

Using N-grams

However, sometimes it’s the combination of words that is important, and not just the words themselves. Example — “not good” and “good” would have same flag for “good” token. Hence it becomes important to find these phrases that occur in our corpus which might affect overall meaning of review. This is what we call N-grams
However, the cost to find these grows polynomially as Vocab Size (V) increases, as in essence we are looking at potentially 𝑂(𝑉*𝑉) combinations of phrases at worse (where V is the size of vocabulary).

In our implementation, we limit to 2 and 3 grams. We further the select a total of top 3000 features, sorted based on their appearance in data.

Final Dataset after using N-Grams Bag-of-Words — Image from Author

Same idea as earlier, however, this time we look for specific n-gram sequences in the sentence.

Using TF-IDF (Term Frequency — Inverse Document Frequency)

Now if you are wondering what is term frequency , it is the relative frequency of a word in a document and is given as (term instances/total instances). Inverse Document Frequency is the relative count of documents containing the term is given as log(number of documents/documents with term) The overall importance of each word to the documents in which they appear is equal to TF * IDF

This will give you a matrix where each column represents a word in the vocabulary (all the words that appear in at least one document) and each row represents a review, as before. This is done to reduce the importance of words that occur frequently in review and therefore, their significance in overall rating of the review.

Fortunately, scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.

Final Dataset after using TF-IDF — Image from Author

Once again, rows constitute sentences and individual tokens represent the columns. If the token is present within the sentence, based on the method discussed above a TFIDF score is calculated and filled in the respective column, otherwise the value is 0.

Similar to CountVectorizer, we use n-grams here as well.

Final Dataset after using N-grams TF-IDF — Image from Author

Step 3. Modeling

Let’s first try out dataset with only TF-IDFs.

Training using LGBM

I’ve used Light GBM because of it’s high accuracy and speed, among other Tree-Based Ensemble models in a lot of tasks. Feel free to experiment here. One good alternative could be XGBoost!

Initializing and Training LightGBM — Image from Author
Results of LightGBM over TF-IDF Dataset — Image from Author

With the TF-IDF approach — combining both word tokens and n-grams, we get a score around 83.3%

Now, let’s check out the results with CountVectorizer Dataset

Results of LightGBM over Bag-of-Words Dataset — Image from Author

With the Bag of Words approach — combining both word tokens and n-grams, we get a score around 82.7%

Accuracy — 83.3%

So, this is our baseline score! Based on Bag of Words and TF-IDF, this is the best we can get. Let’s try to improve it further trying out more powerful techniques! Hope it helped you understand TF-IDF and Count-Vectorizer better and it’s implementation in Python.

There are excellent online sources, especially at DataCamp, where online interactive courses cover the same techniques we discussed in this notebook. Feel free to check them out as well for more understanding! See you in Part — 2 — Recurrent Neural Networks of this series.

--

--

Kanishk Jain

Business Analyst @ P&G | IIT-Bombay alumni | Machine Learning and an AI enthusiast. I enjoy learning and writing about cutting-edge innovations in AI