Text Cleaning for NLP: A Tutorial

Text Cleaning for NLP: A Tutorial

While technology continues to advance, machine learning programs still speak human only as a second language. Effectively communicating with our AI counterparts is key to effective data analysis.

Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so that machines can understand human language. This guide will underline text cleaning’s importance and go through some basic Python programming tips.

Feel free to jump to the section most useful to you, depending on where you are on your text cleaning journey:

  1. What Is Text Cleaning in Machine Learning?
  2. How to Clean Text With Python
  3. Further Text Cleaning Tips & Methods

What Is Text Cleaning in Machine Learning?

Gathering, sorting, and preparing data is the most important step in the data analysis process – bad data can have cumulative negative effects downstream if it is not corrected.

Data preparation, aka data wrangling, meaning the manipulation of data so that it is most suitable for machine interpretation is therefore critical to accurate analysis.

The goal of data prep is to produce ‘clean text’ that machines can analyze error free.

Clean text is human language rearranged into a format that machine models can understand. Text cleaning can be performed using simple Python code that eliminates stopwords, removes unicode words, and simplifies complex words to their root form.

Here’s a quick and easy no-code example of what this might look like (Python coding guide further below):

Say you receive a customer service query with a hashtag and a url:

INPUT:

“Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp”

You’d need to perform the two most basic text cleaning techniques on this query:

  1. Normalizing Text
  2. Removing Stopwords

Normalizing Text

Here we remove capitalization that would confuse a computer model:

  • ‘Hey’ becomes ‘hey’.
  • ‘Amazon’ becomes ‘amazon’.
  • ‘PLEASE FIX’ becomes ‘please fix’.
  • ‘@AmazonHelp’ becomes ‘@amazonhelp’.

INPUT:

“Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @amazonhelp”

OUTPUT:

“hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp”

You’ll notice we still have a fair bit of noise – since NLP will convert @’s, URLs and emojis into unicode, making them unhelpful for analysis, we further normalize by eliminating unicode characters. The same concept applies to punctuation.

INPUT:

“hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp”

OUTPUT:

“hey amazon my package never arrived please fix asap”

Removing stopwords

We are well on our way but still have some words that don’t directly apply to interpretation. Luckily, a number of stopword lists for english and other languages exist and can be easily applied. Observe the results.

INPUT:

“hey amazon my package never arrived please fix asap”

OUTPUT:

“amazon package never arrived fix asap”

And just like that we have turned a complex, multi-element text into a series of keywords primed for text analysis.

This is just the tip of the iceberg – let’s explore some further text cleaning techniques and how they can be programmed in Python.

How to Clean Text With Python

While text cleaning, like data preparation as a whole, has greatly benefited from a number of new self-service AI tools that can standardize and clean your data for you, it is still important to understand the underlying code.

Enter the Natural Language Toolkit (NLTK), a python toolkit specifically designed for raw text to NLP transformation.

With an understanding of a few basic NLTK processes you can easily grasp the foundation of most text cleaning programs, and from there modify and customize them to best serve your purposes!

To get us started we are going to approach how we would achieve our previous examples using python, then graduate to a few more basic techniques.

We will go over the basic python code to:

  1. Normalize Text
  2. Remove Unicode Characters
  3. Remove Stopwords
  4. Perform Stemming and Lemmatization

1. Normalizing Text

Let’s jump right into it by approaching our previous example with python code.

Before doing so, let’s go over why we ‘normalize’ text in a little more depth.

Normalizing text is the process of standardizing text so that, through NLP, computer models can better understand human input, with the end goal being to more effectively perform sentiment analysis and other types of analysis on your customer feedback.

Specifically, normalizing text with Python and the NLTK library means standardizing capitalization so that machine models don’t group capitalized words (Hey) as different from their lowercase counterparts (hey).

This is called case normalization – let’s look at what the code is and the changes it has on our base text.

INPUT:

“Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first PLEASE FIX ASAP! @AmazonHelp”

PYTHON CODE:

Here’s our first swing at Python code – we are simply telling our program to turn every capitalization to lowercase:

text = "Hey Amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first FIX THIS ASAP! @AmazonHelp"

text = text.lower()

print(text)

OUTPUT:

“hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp”

Success! Now our analytics models can group all uses of Amazon and amazon together, etc. Let’s further normalize our text by eliminating punctuation, URL, and @ noise.

2. Removing Unicode Characters

Punctuation, Emoji’s, URL’s and @’s confuse AI models because they are uniques signatures that either end up being translated unhelpfully into unicode (Smiley face becomes \u200c or similar), or are unique (in the case of @’s and hyperlinks).

Punctuation also creates noise and impedes NLP understanding because it relates to the tone of the specific sentence, not necessarily the word it is attached to.

Let’s get into what coding the removal of these examples might look like and see how the output might be better for machine analysis.

INPUT:

We have our case-normalized text:

“hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp”

PYTHON CODE:

We tell our program to eliminate the punctuation, URL, and @:

import re

text = "hey amazon - my package never arrived https://www.amazon.com/gp/css/order-history?ref_=nav_orders_first please fix asap! @amazonhelp"

text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)

print(text)

OUTPUT:

And voilà, we have distilled our example to uniform lowercase words:

“hey amazon my package never arrived please fix asap”

We’ve made great progress, but still have room to parse and simplify the text further.

3. Removing Stopwords

Here, we finally get to make good use of the NLTK library by importing the pre-programmed english stop words library.

With english, among many popular languages, stop words are common words within sentences that do not add value and thus can be eliminated when cleaning for NLP prior to analysis.

Here’s what this looks like when coding our example.

INPUT:

“hey amazon my package never arrived please fix asap”

PYTHON CODE:

import nltk.corpus
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = stopwords.words('english')
text = "my package from amazon never arrived fix this asap"
text = " ".join(\[word for word in text.split() if word not in (stop)])

print(text)

OUTPUT:

“package amazon never arrived fix asap”

The progress we’ve made from the initial example is massive (at least for our analytical purposes). We’ve simplified the language down to standardized words that directly relate to the problem.

We have: amazon[service] package[product] never[time] arrived[problem] fix[request] asap[urgency].

Breaking our example down in this manner not only helps us log and archive the customer request more accurately but also helps us get it in front of the right support team (shipping) at the right level of urgency (as the customer said, asap).

Let’s move on to one final basic step.

4. Stemming and Lemmatization

Stemming and lemmatization via Python is a bit more obtuse than the three previous techniques. It involves breaking down words to their roots and root meanings respectively. By doing so we can better measure intent.

While both techniques are similar, they produce different results so it is important to determine the proper one for the analysis you hope to perform.

Stemming, the simpler of the two, groups words by their root stem. This allows us to recognize that ‘jumping’ ‘jumps’ and ‘jumped’ are all rooted to the same verb (jump) and thus are referring to similar problems.

Lemmatization, on the other hand, groups words based on root definition, and allows us to differentiate between present, past, and indefinite.

So, ‘jumps’ and ‘jump’ are grouped into the present ‘jump’, as different from all uses of ‘jumped’ which are grouped together as past tense, and all instances of ‘jumping’ which are grouped together as the indefinite (meaning continuing/continuous).

So, if we are looking to find all instances of a product (say an engine) having any sort of ‘jump’ related response to analyze all responses, good or bad, we would use stemming.

But, if we want to break this even further down to the type of jump i.e. whether it was in the past, present, or a continuous problem, and want to approach all three different instances with distinct types of analysis, then we will use lemmatizing.

Let’s take a gander at the base code for each:

Stemming

INPUT:

“jump”
“jumps”
“jumped”
“jumping”

PYTHON CODE:

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

words = \["jump", "jumped", "jumps", "jumping"]
stemmer = PorterStemmer()
for word in words:

print(word + " = " + stemmer.stem(word))

OUTPUT:

jump = jump
jumped = jump
jumps = jump
jumping = jump

Lemmatazing

INPUT:

“jump”
“jumps”
“jumped”
“jumping”

PYTHON CODE:

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

words = \["jump", "jumped", "jumps", "jumping"]
lemmatizer = WordNetLemmatizer()
for word in words:

print(word + " = " + lemmatizer.lemmatize(word))

OUTPUT:

jump = jump
jumped = jumped
jumps = jump
jumping = jumping

With these basic techniques, your journey to clean, NLP input-ready data is underway. This guide will now get into some more specific tips, but if you think you’re already primed and ready to analyze your data, check out Monkeylearn’s full suite of no-code analysis tools.

Further Text Cleaning Tips & Methods

With markets more accessible and competitive than ever before, it’s the small things that will make the biggest difference. For this reason, companies need unique, tailor-made approaches to their customer experiences, customer service strategies, and yes – even their text cleaning.

Here are some methods to further hone your text cleaning approach to your needs.

Part of Speech (POS) Tagging

There are eight main parts of speech, and using NLTK to tag each within our data allows us to glean further useful insight from our text.

For instance, by tagging and grouping our adjectives, we can calculate the most and least used descriptors, which points us towards our products’ strengths and weaknesses.

Each part of speech has their own unique POS tag. Here you can see some examples:

Part of SpeechTag
NounN
VerbV
AdjectiveADJ
AdverbADV
PrepositionP
ConjunctionCON
PronounPRO
InterjectionINT

Thankfully, NLTK has a built-in program to tag your text for you.

We can input the following Python code and it will sort any given data set into POS tags (see the full list of POS tags here):

The first step is to tokenize our sentence (split it into words):

INPUT:

amazon package never arrived fix asap

PYTHON CODE:

import nltk 
nltk.download('punkt')

tokens = nltk.word_tokenize("amazon package never arrived fix asap")

print(tokens)

OUTPUT:

['amazon', 'package', 'never', 'arrived', 'fix', 'asap']

Once we have these tokens we can tag each word with its corresponding Part of Speech (see above table):

INPUT:

['amazon', 'package', 'never', 'arrived', 'fix', 'asap']

PYTHON CODE:

import nltk
nltk.download('averaged_perceptron_tagger')

tokens = ['amazon', 'package', 'never', 'arrived', 'fix', 'asap']
pos = nltk.pos_tag(tokens)

print(pos)

OUTPUT:

[('amazon', 'JJ'), ('package', 'NN'), ('never', 'RB'), ('arrived', 'VBD'), ('fix', 'JJ'), ('asap', 'NN')]

Using this kind of code, we can now tabulate the POS totals for large bodies of text!

Further Sorting

Text cleaning has three further sorting functions that may be of use:

  1. Translation
  2. Typo Correction
  3. Number Unification

Translation, despite being the most obvious topic for standardization, is also a subset of text cleaning. You will want to have all your text in the same language so that it can be properly analyzed using the same machine parameters.

In the pursuit of this, it is of utmost importance to keep account of linguistic differences when translating other languages to your base language of choice. Not all languages have the same descriptors, and verbs that translate the same often diverge in meaning to a native speaker.

Typo Correction, while obvious, has to be one of the first steps before taking on any of the previously mentioned major text cleaning steps. Often, social media posts and reviews are riddled with deliberately misspelled words (like, ‘biz’ instead of ‘business’, ‘wiv’ instead of ‘with’, ‘woz’ instead of ‘was’, and ‘da’ instead of ‘the’), as well as accidental spelling errors.

While it might seem as simple as losing the misspelled words, those words could convey important meaning, so keeping a catalogue of common misspellings and correcting as much as possible is crucial.

Finally, Number Unification is absolutely essential – if your numbers aren’t standardized you have bad data. As a subset of data preparation, standardizing address and phone numbers so that they are in the same format ensures your data analysis will be accurate and not ruined by a couple entries where users put their street and city in the same field.

Takeaways

Any text cleaning approach is about attention to detail and boiling your data down to only it’s most crucial bits, without losing it’s context – and that’s a hard balance to strike.

That’s what makes text cleaning fascinating and, now that we have the help of AI tools to sort through millions of lines for us, innovative and effective approaches to text cleaning are right at our fingertips.

Once our data is clean and prepared, it’s time to remember why we went through all of that trouble in the first place.

Monkeylearn provides a suite of self-service no-code tools ready to analyze any data set.

Sign up for a free demo, or explore our full suite of tools via our built-in Python API today.

Inés Roldós

May 31st, 2021

Posts you might like...

MonkeyLearn Logo

Text Analysis with Machine Learning

Turn tweets, emails, documents, webpages and more into actionable data. Automate business processes and save hours of manual data processing.

Try MonkeyLearn
Clearbit LogoSegment LogoPubnub LogoProtagonist Logo