Text Preprocessing with NLTK (2024)

A detailed walkthrough of preprocessing a sample corpus with the NLTK library using stemming and lemmatization.

What is Natural Language Processing?

Natural Language Processing or NLP is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a manner that is valuable. To this end, many different models, libraries, and methods have been used to train machines to process text, understand it, make predictions based on it, and even generate new text. The first step to training a model is to obtain and preprocess the data. In this article, I will be going through some of the most common steps to be followed with almost any dataset before you can pass it as an input to a model.

What is NLTK?

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. It consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition, some of which we will be making use of in this article.

Initial Steps

First we import the required NLTK toolkit.

# Importing modules
import nltk

Now we import the required dataset, which can be stored and accessed locally or online through a web URL. We can also make use of one of the corpus datasets provided by NLTK itself. In this article, we will be using a sample corpus dataset provided by NLTK.

# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
print(corpus)

We print the corpus so that we can take a look at the text, study it, and make note of special characters and other changes that might need to be made before training a model based on it.

Preliminary Statistics

We now look at how to extract some statistics from the corpus, such as the number of sentences, etc. using tokenization. These statistics can later be used to set some parameters while training a model. Tokenization is the process by which big quantities of text are divided into smaller parts called tokens. It is crucial to understand the pattern in the text in order to perform various NLP tasks. These tokens are very useful for finding such patterns. NLTK has a very important module tokenize which further comprises of sub-modules -

word tokenize
sentence tokenize

from nltk.tokenize import word_tokenize,sent_tokenizesents = nltk.sent_tokenize(corpus)
print("The number of sentences is", len(sents))words = nltk.word_tokenize(corpus)
print("The number of tokens is", len(words))average_tokens = round(len(words)/len(sents))
print("The average number of tokens per sentence is",average_tokens)unique_tokens = set(words)
print("The number of unique tokens are", len(unique_tokens))from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
final_tokens = []for each in words:
 if each not in stop_words:
 final_tokens.append(each)print("The number of total tokens after removing stopwords are", len((final_tokens)))

Now that we have some numerical descriptors of the dataset, we can take a look at stemming and lemmatization.

Stemming and Lemmatization with NLTK

What is Stemming?
Stemming is a kind of normalization for words. It is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized. Stemming is hence a way to find the root word from variations of the word.

NLTK provides many inbuilt stemmers such as the Porter Stemmer, Snowball Stemmer and Lancaster Stemmer. We will look at the differences between the Porter Stemmer and the Snowball Stemmer.

from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Snowball Stemmer has language as a parameter.words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]#Create instances of both stemmers, and stem the words using them.stemmer_ps = PorterStemmer() 
#an instance of Porter Stemmerstemmed_words_ps = [stemmer_ps.stem(word) for word in words]
print("Porter stemmed words: ", stemmed_words_ps)stemmer_ss = SnowballStemmer("english") 
#an instance of Snowball Stemmerstemmed_words_ss = [stemmer_ss.stem(word) for word in words]
print("Snowball stemmed words: ", stemmed_words_ss)

Once we create an instance of the stemmers, we write a function which takes each sentence of a corpus as input and returns the stemmed version of the word.

# A function which takes a sentence/corpus and gets its stemmed version.def stemSentence(sentence):
 token_words=word_tokenize(sentence) #we need to tokenize the sentence or else stemming will return the entire sentence as is.
 stem_sentence=[]
 for word in token_words:
 stem_sentence.append(stemmer_ps.stem(word))
 stem_sentence.append(" ") #adding a space so that we can join all the words at the end to form the sentence again.
 return "".join(stem_sentence)stemmed_sentence = stemSentence("The circ*mstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given.")
print("The Porter stemmed sentence is: ", stemmed_sentence)

We observe that the 2 stemmers are nearly the same, except in the case of some adverbs, where the Snowball Stemmer seems to give a better output closer to the root word.

Some differences between the Porter Stemmer and Snowball Stemmer are -

Snowball Stemmer is more aggressive than Porter Stemmer.
Some issues in Porter Stemmer are fixed in Snowball Stemmer.
Words like ‘fairly‘ and ‘sportingly‘ are stemmed to ‘fair’ and ‘sport’ in the Snowball Stemmer but are stemmed to ‘fairli‘ and ‘sportingli‘ with the Porter Stemmer.

As a general rule of thumb, the Snowball Stemmer stems words to a more accurate stem.

What is Lemmatization?
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.

The NLTK Lemmatization method is based on WordNet’s built-in morph function.

We write some code to import the WordNet Lemmatizer.

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') 
# Since Lemmatization is based on WordNet's built-in morph function.

Now that we have downloaded the wordnet, we can go ahead with lemmatization. Lemmatization can be done with or without a POS tag. A POS or part-of-speech tag assigns a tag to each word, and hence increases the accuracy of the lemma in the context of the dataset. For example, the word ‘leaves’ without a POS tag would get lemmatized to the word ‘leaf’, but with a verb tag, its lemma would become ‘leave’.

words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]lemmatizer = WordNetLemmatizer() 
#an instance of Word Net Lemmatizerlemmatized_words = [lemmatizer.lemmatize(word) for word in words] 
print("The lemmatized words: ", lemmatized_words) 
#prints the lemmatized wordslemmatized_words_pos = [lemmatizer.lemmatize(word, pos = "v") for word in words]
print("The lemmatized words using a POS tag: ", lemmatized_words_pos) 
#prints POS tagged lemmatized words

Now that we have created instances of the lemmatizers, we write a function which takes as input each sentence of the corpus and returns its lemmatized version.

#A function which takes a sentence/corpus and gets its lemmatized version.def lemmatizeSentence(sentence):
 token_words=word_tokenize(sentence) 
#we need to tokenize the sentence or else lemmatizing will return the entire sentence as is. lemma_sentence=[]
 for word in token_words:
 lemma_sentence.append(lemmatizer.lemmatize(word))
 lemma_sentence.append(" ")
 return "".join(lemma_sentence)lemma_sentence = lemmatizeSentence("The circ*mstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given.")
print("The lemmatized sentence is: ", lemma_sentence)

In order to get results more in accordance with the context of the dataset, POS tags can be used with the lemmatizer.

How are Stemming and Lemmatization Different?

Stemming reduces word-forms to stems in order to reduce size, whereas lemmatization reduces the word-forms to linguistically valid lemmas. For example, the stem of the word ‘happy’ is ‘happi’, but its lemma is ‘happy’, which is linguistically valid.
Lemmatization is usually more sophisticated and requires some sort of lexica. Stemming, on the other hand, can be achieved with simple rule-based approaches.
A stemmer operates on a single word without knowledge of the context, and cannot discriminate between words which have similar/different meanings depending on part of speech. For example, the word ‘better’ has ‘good’ as its lemma. This link is missed by stemming, as it shows ‘bet’ as the stem.

Conclusion

I hope this article was a good introduction to text preprocessing using stemming and lemmatization, and the associated differences between the two. Apart from these, there are many other tasks to be done before the corpus can be fed into a model to train, such as removal of newlines, special characters, conversion to lower case, etc. These will be covered in future articles. The full code used in this article can be accessed here.