Text Cleaning: The Secret Weapon for Smarter NLP Models

8 min readJul 6, 2023

Natural Language Processing (NLP) has become a critical component in every sector nowadays. State-of-the-art NLP algorithms are being designed to understand human language and provide insights that help us make informed decisions. However, NLP algorithms are only as good as the quality of the text they are provided with. Dirty or noisy text data can significantly impact the performance of an NLP model. This is where text cleaning, also known as text preprocessing, comes into play. In this article, we’ll explore the different text-cleaning techniques that can help you build smarter NLP models.

A brief overview of the contents:

There are many text cleaning techniques. Out of that we will discuss the below important ones here :

Convert to lowercase
Remove numbers
Remove punctuations
Expand contractions
Spelling correction
Stopwords removal
Stemming
Lemmatization
Remove extra Whitespaces

1. Convert to lowercase:

A word can come in a different format in different parts of the text. Converting text to lowercase normalizes these different formatted words into one common format. If this normalization is not done then the same word “running” and “Running” will be treated as two different words in the text representation stage (TF-IDF, Bag of words, etc.)

texts = [
        "THIS IS THE FIRST TEXT.", 
        "Here is the SECOND text.", 
        "And this is the THIRD text."
]

# Convert each text in the list to lowercase
lowercase_texts = [text.lower() for text in texts]

print(lowercase_texts)

To know more about text representation that is easily understood by the algorithms check out the below blog post:

From Traditional to Modern: A Comprehensive Guide to Text Representation Techniques in NLP

Natural Language Processing (NLP) is a rapidly growing field that focuses on enabling machines to understand and…

deysusovan93.medium.com

2. Remove numbers:

Numbers can be removed from text if the use case doesn't require numbers to be considered. For example, Numbers may be useful for text that contains product ratings or while analyzing financial data. We will use regex to remove the numbers.

Note: Another workaround is to convert the digits into text.

import re

# Original sentences with numbers
sentences = [
    "I have 3 cats and 2 dogs.",
    "There are 10 benches in the classroom.",
    "The recipe calls for 2 cups of flour and 1 teaspoon of salt."
]

# Remove numbers from the sentences
cleaned_sentences = [re.sub(r'\d+', '', sentence) for sentence in sentences]

print(cleaned_sentences)

3. Remove punctuations:

Punctuations are those special characters that are not alphanumeric. It is important to remember that the decision to remove or keep punctuation should be made based on the specific requirements of your NLP task and the nature of your dataset.

Punctuation list :“[!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~].

For example, if we are trying to check the sentiment of a sentence then we don’t require punctuations. viz : “I love this product!” -> “I love this product”
On the other hand, for NER (Named Entity Recognition) punctuations can help in preserving the structure and boundaries of entities. viz: “The event took place at Madison Square Garden, New York.” The comma after ‘Garden’ helps identify the location as ‘Madison Square Garden’.

import string

# Original sentences with punctuations
sentences = [
    "I love apples!!!","What's your name?","Can you believe it?! This is amazing!"
]

# Remove punctuations from the sentences
cleaned_sentences = [sentence.translate(str.maketrans("", "", string.punctuation)) for sentence in sentences]

print(cleaned_sentences)

4. Expand contractions

Contractions are formed by dropping letters and replacing them with an apostrophe. We have to choose this technique if our problem statement is required. Otherwise, leave it as it is.

For example :
“don’t” → “do not”.
“should’ve” → “should have”.

NLP models are unaware of these contractions, and treats “don’t” and “do not” as two different words. Hence it should be normalized.

#mapping file for contractions
contraction_mapping = {
    "don't": "do not",
    "won't": "will not",
    "can't": "cannot",
    "should've": "should have",
    "it's": "it is"
}

# Original sentences with contractions
sentences = [
    "I don't know what to do.",
    "We won't be able to make it.",
    "Can't wait for the weekend!",
]

# Expand contractions in each sentence
expanded_sentences = []
for sentence in sentences:
    expanded_words = [contraction_mapping.get(word.lower(), word) for word in sentence.split()]
    expanded_sentence = " ".join(expanded_words)
    expanded_sentences.append(expanded_sentence)

print(expanded_sentences)

Apart from creating a dictionary for these contractions, as per the above code, we can also make use of the python package “contraction” which is much more advanced than the above method.

5. Spelling correction

Spelling correction is a crucial preprocessing technique when dealing with tweets, comments, and similar texts due to the presence of misspelled words. It is essential to rectify these errors by replacing them with the correct spelling.

To achieve this, we can leverage python library pyspellchecker to identify and replace misspelled words with their correct counterparts.

#pip install   pyspellchecker
from spellchecker import SpellChecker

# Example sentences with incorrect spellings
sentences = [
    "This is an example of incorect spelling.",
    "I lv to read peoms.",
    "The cat jumpped ovver the fence."
]

# Initialize the spell checker
spell = SpellChecker()

# Function to correct the spellings in a sentence
def correct_spelling(sentence):
    corrected_sentence = []
    words = sentence.split()
    for word in words:
        corrected_word = spell.correction(word)
        corrected_sentence.append(corrected_word)
    return " ".join(corrected_sentence)

# Correct the spellings in each sentence
corrected_sentences = [correct_spelling(sentence) for sentence in sentences]

# Print the corrected sentences
for corrected_sentence in corrected_sentences:
    print(corrected_sentence)

Apart from pyspellchecker there are a few other libraries as well and they are textblob, penchant, pyspellcheck, autocorrect. But what I have seen is these libraries does not provide accurate result all the time.

For example if you see the second output it was not able to correct the word “love” properly.

Hence we should verify the accuracy of the corrected spellings as well during the preprocessing step.

6. Stopwords Removal

Stopwords are those common and irrelevant words that do not provide valuable information for our model or problem statement.

Examples of stopwords include “a,” “an,” “the,” and others.

We can disregard stopwords, for scenarios such as sentiment analysis and text classification problems. However, for tasks like POS tagging or language translation, we must assess whether stop words contain meaningful information relevant to our problem statement.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Example sentence
sentence = "This is an example sentence demonstrating stopword removal."

# Download the stopwords corpus if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Get the set of stopwords
stopword_set = set(stopwords.words('english'))

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Remove the stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stopword_set]

# Join the filtered tokens back into a sentence
filtered_sentence = ' '.join(filtered_tokens)

print(filtered_sentence)

7. Stemming

Stemming is a technique commonly used in NLP to simplify words by reducing them to their base or root form. By removing suffixes and prefixes, stemming helps in standardizing words and improving text analysis and information retrieval tasks.

For example, by applying the stemming to words like “books,” “learning,” “played,” and “running,” we obtain their stemmed forms as “book,” “learn,” “play,” and “run,” respectively. Stemming proves to be a valuable tool in preprocessing and enhancing the effectiveness of NLP applications.

from nltk.stem import PorterStemmer

# Create an instance of the Porter stemmer
stemmer = PorterStemmer()

# Example words to be stemmed
words = ['books', 'caring', 'programmes', 'running','consoling']

# Perform stemming on each word
stemmed_words = [stemmer.stem(word) for word in words]

# Print the original words and their stemmed forms
for original, stemmed in zip(words, stemmed_words):
    print(f"Original: {original} \t Stemmed: {stemmed}")

8. Lemmatization

Lemmatization is used to reduce words to their base or dictionary form, known as lemmas. Unlike stemming, lemmatization considers the context and part of speech of a word to ensure meaningful and grammatically correct lemmas.

For example : Lemmatizing the word “consoling” gives us “console” which is an actual word, where as stemming of that word produces “consol” which has no meaning.

Python libraries, such as NLTK (Natural Language Toolkit) and spaCy, provide easy-to-use functions for lemmatization.

import spacy

# Example words
words = ['books', 'caring', 'programmes', 'running','consoling']

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the sentence with spaCy
doc = nlp(" ".join(words))

# Perform lemmatization on each token
lemmatized_words = [token.lemma_ for token in doc]

# Print the lemmatized words
for original, lemmatized in zip(words, lemmatized_words):
    print(f"Original: {original} \t Lemmatized: {lemmatized}")

9. Remove extra Whitespaces

This should be the last preprocessing step where extra spaces, newlines, tabs, and additional spaces, should be removed as it does not provide any meaningful information. It helps improve data consistency, enhances tokenization accuracy, enables more effective feature extraction, optimizes space utilization, and reduces noise in the text, leading to better overall results in downstream NLP tasks.

import re

def preprocess_text(text):
    # Remove extra spaces, newlines, and tabs
    text = re.sub(r'\s+', ' ', text)
    
    # Remove additional spaces
    text = re.sub(r'\s{2,}', ' ', text)
    
    # Remove leading and trailing spaces
    text = text.strip()
    
    return text

# Example text
text = """
This is a sample text with   extra    spaces,
newlines,     tabs, and additional spaces.     
"""

# Preprocess the text
preprocessed_text = preprocess_text(text)

print(preprocessed_text)

Congratulations on making it to this section of the blog! 👏 You are one step ahead towards your NLP journey!!

Text cleaning, or text preprocessing, is an essential step in building smarter NLP models. By converting to lowercase, removing numbers and punctuations, expanding contractions, performing spelling correction, removing extra whitespaces, handling stopwords, applying stemming, and lemmatization, we can ensure that our NLP models are equipped with clean and meaningful text data, leading to more accurate insights and informed decision-making. Happy text cleaning!

If you found this blog post on text-cleaning techniques helpful, make sure to follow me, Susovan Dey, to stay updated with more informative articles on NLP and data science topics.

Susovan Dey - Medium

Read writing from Susovan Dey on Medium. Data scientist with a passion for automating tasks & analyzing data to uncover…

medium.com

Thank you for taking the time to read the content.Clap 👏 if you have enjoyed the content.