Text Classification: What it is And Why it Matters

Text can be an extremely rich source of information, but extracting insights from it can be hard and time-consuming, due to its unstructured nature.

But, thanks to advances in natural language processing and machine learning, which both fall under the vast umbrella of artificial intelligence, sorting text data is getting easier.

Tag sentiments & topics in text with NLP

TRY NOW

It works by automatically analyzing and structuring text, quickly and cost-effectively, so businesses can automate processes and discover insights that lead to better decision-making.

Read on to learn more about text classification, how it works, and how easy it is to get started with no-code text classification tools like MonkeyLearn's sentiment analyzer.

What is Text Classification?

Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web.

For example, new articles can be organized by topics; support tickets can be organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on.

Text classification is one of the fundamental tasks in natural language processing with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Here’s an example of how it works:

“The user interface is quite straightforward and easy to use.”

A text classifier can take this phrase as an input, analyze its content, and then automatically assign relevant tags, such as UI and Easy To Use.

Input text is processed by a text classification model and delivers output tags.

Why is Text Classification Important?

It’s estimated that around 80% of all information is unstructured, with text being one of the most common types of unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so most companies fail to use it to its full potential.

This is where text classification with machine learning comes in. Using text classifiers, companies can automatically structure all manner of relevant text, from emails, legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. This allows companies to save time analyzing text data, automate business processes, and make data-driven business decisions.

Why use machine learning text classification? Some of the top reasons:

Scalability

Manually analyzing and organizing is slow and much less accurate.. Machine learning can automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few minutes. Text classification tools are scalable to any business needs, large or small.

Real-time analysis

There are critical situations that companies need to identify as soon as possible and take immediate action (e.g., PR crises on social media). Machine learning text classification can follow your brand mentions constantly and in real time, so you'll identify critical information and be able to take action right away.

Consistent criteria

Human annotators make mistakes when classifying text data due to distractions, fatigue, and boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand, applies the same lens and criteria to all data and results. Once a text classification model is properly trained it performs with unsurpassed accuracy.

How Does Text Classification Work?

You can perform text classification in two ways: manual or automatic.

Manual text classification involves a human annotator, who interprets the content of text and categorizes it accordingly. This method can deliver good results but it’s time-consuming and expensive.

Automatic text classification applies machine learning, natural language processing (NLP), and other AI-guided techniques to automatically classify text in a faster, more cost-effective, and more accurate manner.

In this guide, we’re going to focus on automatic text classification.

There are many approaches to automatic text classification, but they all fall under three types of systems:

Rule-based systems
Machine learning-based systems
Hybrid systems

Rule-based systems

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule consists of an antecedent or pattern and a predicted category.

Say that you want to classify news articles into two groups: Sports and Politics. First, you’ll need to define two lists of words that characterize each group (e.g., words related to sports such as football, basketball, LeBron James, etc., and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.).

Next, when you want to classify a new incoming text, you’ll need to count the number of sport-related words that appear in the text and do the same for politics-related words. If the number of sports-related word appearances is greater than the politics-related word count, then the text is classified as Sports and vice versa.

For example, this rule-based system will classify the headline “When is LeBron James' first game with the Lakers?” as Sports because it counted one sports-related term (LeBron James) and it didn’t count any politics-related terms.

Rule-based systems are human comprehensible and can be improved over time. But this approach has some disadvantages. For starters, these systems require deep knowledge of the domain. They are also time-consuming, since generating rules for a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-based systems are also difficult to maintain and don’t scale well given that adding new rules can affect the results of the pre-existing rules.

Machine learning based systems

Instead of relying on manually crafted rules, machine learning text classification learns to make classifications based on past observations. By using pre-labeled examples as training data, machine learning algorithms can learn the different associations between pieces of text, and that a particular output (i.e., tags) is expected for a particular input (i.e., text). A “tag” is the pre-determined classification or category that any given text could fall into.

The first step towards training a machine learning NLP classifier is feature extraction: a method is used to transform each text into a numerical representation in the form of a vector. One of the most frequently used approaches is bag of words, where a vector represents the frequency of a word in a predefined dictionary of words.

For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize the text “This is awesome,” we would have the following vector representation of that text: (1, 1, 0, 0, 1, 0, 0).

Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:

Once it’s trained with enough training samples, the machine learning model can begin to make accurate predictions. The same feature extractor is used to transform unseen text to feature sets, which can be fed into the classification model to get predictions on tags (e.g., sports, politics):

Prediction process in Text Classification

Text classification with machine learning is usually much more accurate than human-crafted rule systems, especially on complex NLP classification tasks. Also, classifiers with machine learning are easier to maintain and you can always tag new examples to learn new tasks.

Machine Learning Text Classification Algorithms

Some of the most popular text classification algorithms include the Naive Bayes family of algorithms, support vector machines (SVM), and deep learning.

Naive Bayes

The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis, overall.

One of the members of that family is Multinomial Naive Bayes (MNB) with a huge advantage, that you can get really good results even when your dataset isn’t very large (~ a couple of thousand tagged samples) and computational resources are scarce.

Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional probabilities of the occurrence of two events, based on the probabilities of the occurrence of each individual event. So we’re calculating the probability of each tag for a given text, and then outputting the tag with the highest probability.

Naive Bayes formula.

The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of A being true, divided by the probability of B being true.

This means that any vector that represents a text will have to contain information about the probabilities of the appearance of certain words within the texts of a given category, so that the algorithm can compute the likelihood of that text belonging to the category.

Take a look at this blog post to learn more about Naive Bayes.

Support Vector Machines

Support Vector Machines (SVM) is another powerful text classification machine learning algorithm, becauseike Naive Bayes, SVM doesn’t need much training data to start providing accurate results. SVM does, however, require more computational resources than Naive Bayes, but the results are even faster and more accurate.

In short, SVM draws a line or “hyperplane” that divides a space into two subspaces. One subspace contains vectors (tags) that belong to a group, and another subspace contains vectors that do not belong to that group.

The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like this:

Those vectors are representations of your training texts, and a group is a tag you have tagged your texts with.

As data gets more complex, it may not be possible to classify vectors/tags into only two categories. So, it looks like this:

But that’s the great thing about SVM algorithms – they’re “multi-dimensional.” So, the more complex the data, the more accurate the results will be. Imagine the above in three dimensions, with an added Z-axis, to create a circle.

Mapped back to two dimensions the ideal hyperplane looks like this:

Optimanl hyperplane in two dimensions with more complex data.

Deep Learning

Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural networks. Deep learning architectures offer huge benefits for text classification because they perform at super high accuracy with lower-level engineering and computation.

The two main deep learning architectures for text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

Deep learning is hierarchical machine learning, using multiple algorithms in a progressive chain of events. It’s similar to how the human brain works when making decisions, using different techniques simultaneously to process huge amounts of data.

Deep learning algorithms do require much more training data than traditional machine learning algorithms (at least millions of tagged examples). However, they don’t have a threshold for learning from training data, like traditional machine learning algorithms, such as SVM and NBeep learning classifiers continue to get better the more data you feed them with:

Deep Learning vs Traditional Machine Learning algorithms

Deep learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector representations for words and improve the accuracy of classifiers trained with traditional machine learning algorithms.

Hybrid Systems

Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been correctly modeled by the base classifier.

Metrics and Evaluation

Cross-validation is a common method to evaluate the performance of a text classifier. It works by splitting the training dataset into random, equal-length example sets (e.g., 4 sets with 25% of the data). For each set, a text classifier is trained with the remaining samples (e.g., 75% of the samples). Next, the classifiers make predictions on their respective sets, and the results are compared against the human-annotated tags. This will determine when a prediction was right (true positives and true negatives) and when it made a mistake (false positives, false negatives).

With these results, you can build performance metrics that are useful for a quick assessment on how well a classifier works:

Accuracy: the percentage of texts that were categorized with the correct tag.
Precision: the percentage of examples the classifier got right out of the total number of examples that it predicted for a given tag.
Recall: the percentage of examples the classifier predicted for a given tag out of the total number of examples it should have predicted for that given tag.
F1 Score: the harmonic mean of precision and recall.

Why is Text Classification Important?

Why use machine learning text classification? Some of the top reasons:

Scalability

Manually analyzing and organizing is slow and much less accurate. Machine learning can automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few minutes. Text classification tools are scalable to any business needs, large or small.

Real-time analysis

Consistent criteria

Text Classification Examples

Text Classification Applications & Use Cases

Text classification can be used in a broad range of contexts such as classifying short texts (e.g., tweets, headlines, chatbot queries, etc.) or organizing much larger documents (e.g., customer reviews, news articles,legal contracts, longform customer surveys, etc.). Some of the most well-known examples of text classification include sentiment analysis, topic labeling, language detection, and intent detection.

Sentiment Analysis

Perhaps the most popular example of text classification is sentiment analysis (or opinion mining): the automated process of reading a text for opinion polarity (positive, negative, neutral, and beyond). Companies use sentiment classifiers on a wide range of applications, like product analytics, brand monitoring, market research, customer support, workforce analytics, and much more.

Sentiment analysis allows you to automatically analyze all forms of text for the feeling and emotion of the writer.

Try out this pre-trained sentiment classifier with your own text to see just how easy it is to do.

Test with your own text

Results

TagConfidence

Positive99.9%

If you see an odd result, don’t worry, it’s just because it hasn’t been trained (yet) with similar expressions. For super accurate results trained to the specific language and criteria of your business, follow this quick sentiment analysis tutorial to build a custom sentiment analysis model in just five steps.

Topic Labeling

Another common example of text classification is topic labeling, that is, understanding what a given text is talking about. It’s often used for structuring and organizing data, such as organizing customer feedback by topic or organizing news articles by subject.

Try out this pre-trained model for classifying NPS responses for SaaS products according to their topic. It tags customer feedback by categories: Customer Support, Ease of Use, Features, and Pricing:

Test with your own text

Results

TagConfidence

Customer Support84.5%

Learn more about topic labeling and how to build a custom multi-label text classifier.

Language Detection

Language detection is another great example of text classification, that is, the process of classifying incoming text according to its language. These text classifiers are often used for routing purposes (e.g., route support tickets according to their language to the appropriate team).

The following is a classifier trained for detecting 49 different languages in text:

Test with your own text

Results

TagConfidence

English-en100.0%

Intent Detection

Intent detection or intent classification is another great use case for text classification that analyzes text to understand the reason behind feedback. Maybe it’s a complaint, or maybe a customer is expressing intent to purchase a product. It’s used for customer service, marketing email responses, generating product analytics, and automating business practices. Intent detection with machine learning can read emails and chatbot conversations and automatically route them to the correct department.

Try out this email intent classifier that’s trained to detect the intent of email replies. It classifies with tags: Interested, Not Interested, Unsubscribe, Wrong Person, Email Bounce, and Autoresponder:

Test with your own text

Results

TagConfidence

Interested100.0%

Text Classification Applications & Use Cases

Text classification has thousands of use cases and is applied to a wide range of tasks. In some cases, data classification tools work behind the scenes to enhance app features we interact with on a daily basis (like email spam filtering). In some other cases, classifiers are used by marketers, product managers, engineers, and salespeople to automate business processes and save hundreds of hours of manual data processing.

Some of the top applications and use cases of text classification include:

Detecting urgent issues
Automating customer support processes
Listening to the Voice of customer (VoC)

Detecting Urgent Issues

On Twitter alone, users send 500 million tweets every day.

And surveys show that 83% of customers who comment or complain on social media expect a response the same day, with 18% expecting it to come immediately.

With the help of text classification, businesses can make sense of large amounts of data using techniques like aspect-based sentiment analysis to understand what people are talking about and how they’re talking about each aspect. For example, a potential PR crisis, a customer that’s about to churn, complaints about a bug issue or downtime affecting more than a handful of customers.

Automating Customer Support Processes

Building a good customer experience is one of the foundations of a sustainable and growing company. According to Hubspot, people are 93% more likely to be repeat customers at companies with excellent customer service. The study also unveiled that 80% of respondents said they had stopped doing business with a company because of a poor customer experience.

Text classification can help support teams provide a stellar experience by automating tasks that are better left to computers, saving precious time that can be spent on more important things.

For instance, text classification is often used for automating ticket routing and triaging. Text classification allows you to automatically route support tickets to a teammate with specific product expertise. If a customer writes in asking about refunds, you can automatically assign the ticket to the teammate with permission to perform refunds. This will ensure the customer gets a quality response more quickly.

Support teams can also use sentiment classification to automatically detect the urgency of a support ticket and prioritize those that contain negative sentiments. This can help you lower customer churn and even turn a bad situation around.

Listening to Voice of Customer (VoC)

Companies leverage surveys such as Net Promoter Score to listen to the voice of their customers at every stage of the journey.

The information gathered is both qualitative and quantitative, and while NPS scores are easy to analyze, open-ended responses require a more in-depth analysis using text classification techniques. Instead of relying on humans to analyze voice of customer data, you can quickly process open-ended customer feedback with machine learning. Classification models can help you analyze survey results to discover patterns and insights like:

What do people like about our product or service?
What should we improve?
What do we need to change?

By combining both quantitative results and qualitative analyses, teams can make more informed decisions without having to spend hours manually analyzing every single open-ended response.

Text Classification Resources

Once you start to automate manual and repetitive tasks using all manner of text classification techniques, you can focus on other areas of your business.

But… how the heck do you get started with text classification? There’s so much information about text analysis, machine learning, and natural language processing that it can be overwhelming.

At MonkeyLearn, we make it easy for you to know where to start. We provide a no-code text classifier builder, so you can build your very own text classifier in a few simple steps.

Building your first text classifier can help you really understand the benefits of text classification, but before we go into more detail about what MonkeyLearn can do, let’s take a look at what you’ll need to create your own text classification model:

1. Datasets

A text classifier is worthless without accurate training data. Machine learning algorithms can only make accuaret predictions by learning from previous examples.

You show an algorithm examples of correctly tagged data, and it uses that tagged data to make predictions on unseen text.

Say you want to predict the intent of chat conversations; you’ll need to identify and gather chat conversations that represent the different intents you want to predict. If you train your model with another type of data, the classifier will provide poor results.

So, how do you get training data?

You can use internal data generated from the apps and tools you use every day, like CRMs (e.g. Salesforce, Hubspot), chat apps (e.g. Slack, Drift, Intercom), help desk software (e.g. Zendesk, Freshdesk, Front), survey tools (e.g. SurveyMonkey, Typeform, Google Forms), and customer satisfaction tools (e.g. Promoter.io, Retently, Satismeter). These tools usually provide an option to export data in a CSV file that you can use to train your classifier.

Another option is using external data from throughout the web, either by using web scraping, APIs, or public datasets.

The following are some publicly available datasets you can use for building your first text classifier and start experimenting right away.

Topic classification:

Reuters news dataset: probably one the most widely used dataset for text classification; it contains 21,578 news articles from Reuters labeled with 135 categories according to their topic, such as Politics, Economics, Sports, and Business.
20 Newsgroups: another popular datasets that consists of ~20,000 documents across 20 different topics.

Sentiment analysis:

Amazon Product Reviews: a well-known dataset that contains ~143 million reviews and star ratings (1 to 5 stars) spanning May 1996 - July 2014. You can get an alternative dataset for Amazon product reviews here.
IMDB reviews: a much smaller dataset with 25,000 movie reviews labeled as positive and negative from the Internet Movie Database (IMDB).
Twitter Airline Sentiment: this dataset contains around 15,000 tweets about airlines labeled as positive, neutral, and negative.

Other popular datasets:

Spambase: a dataset with 4,601 emails labeled as spam and not spam.
SMS Spam Collection: another dataset for spam detection that consists of 5,574 SMS messages tagged as spam or legitimate.
Hate speech and offensive language: this dataset contains 24,802 labeled tweets organized into three categories: clean, hate speech, and offensive language.

2. Text Classification Tools

Alright. Now that you have training data, it's time to feed it to a machine learning algorithm and create a text classifier.

So, how do we do this?

Luckily, many resources can help you during the different phases of the process, i.e. transforming texts into vectors, training a machine learning algorithm, and using a model to make predictions. Broadly speaking, these tools can be classified into two different categories:

It’s an ongoing debate: Build vs. Buy. Open-source libraries can perform among the upper echelon of machine learning text classification tools, but they’re costly and time-consuming to build and require years of data science and computer engineering experience.

SaaS tools, on the other hand, require little to no code, are completely scalable and much less costly, as you only use the tools you need. Best of all, most can be implemented right away and trained (often in just a few minutes) to perform just as fast and accurately.

Open-source libraries for text classification

One of the reasons machine learning has become mainstream is thanks to the myriad of open source libraries available for developers interested in applying it. Although they require a hefty data science and machine learning background these libraries offer a fair level of abstraction and simplification. Python, Java, and R all offer a wide selection of machine learning libraries that are actively developed and provide a diverse set of features, performance, and capabilities.

Text Classification with Python

Python is usually the programming language of choice for developers and data scientists who work with machine learning models. The simple syntax, its massive community, and the scientific-computing friendliness of its mathematical libraries are some of the reasons why Python is so prevalent in the field.

Scikit-learn is one of the go-to libraries for general purpose machine learning. It supports many algorithms and provides simple and efficient features for working with text classification, regression, and clustering models. If you are a beginner in machine learning, scikit-learn is one of the most friendly libraries for getting started with text classification, with dozens of tutorials and step-by-step guides all over the web.

NLTK is a popular library focused on natural language processing (NLP) that has a big community behind it. It's super handy for text classification because it provides all kinds of useful tools for making a machine understand text, such as splitting paragraphs into sentences, splitting up words, and recognizing the part of speech of those words.

A modern and newer NLP library is SpaCy, a toolkit with a more minimal and straightforward approach than NLTK. For example, spaCy only implements a single stemmer (NLTK has 9 different options). SpaCy has also integrated word embeddings, which can be useful to help boost accuracy in text classification.

Once you are ready to experiment with more complex algorithms, you should check out deep learning libraries like Keras, TensorFlow, and PyTorch. Keras is probably the best starting point as it's designed to simplify the creation of recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

TensorFlow is the most popular open source library for implementing deep learning algorithms. Developed by Google and used by companies, such as Dropbox, eBay, and Intel, this library is optimized for setting up, training, and deploying artificial neural networks with massive datasets. Although it’s harder to master than Keras, it’s the undisputed leader in the deep learning space. A reliable alternative to TensorFlow is PyTorch, an extensive deep learning library primarily developed by Facebook and backed by Twitter, Nvidia, Salesforce, Stanford University, University of Oxford, and Uber.

Text Classification with Java

Another programming language that is broadly used for implementing machine learning models is Java. Like Python, it has a big community, an extensive ecosystem, and a great selection of open source libraries for machine learning and NLP.

CoreNLP is the most popular framework for NLP in Java. Created by Stanford University, it provides a diverse set of tools for understanding human language such as a text parser, a part-of-speech (POS) tagger, a named entity recognizer (NER), a coreference resolution system, and information extraction tools.

Another popular toolkit for natural language tasks is OpenNLP. Created by The Apache Software Foundation, it provides a bunch of linguistic analysis tools useful for text classification such as tokenization, sentence segmentation, part-of-speech tagging, chunking, and parsing.

Weka is a machine learning library developed by the University of Waikato and contains many tools like classification, regression, clustering, and data visualization. It provides a graphical user interface for applying Weka’s collection of algorithms directly to a dataset, and an API to call these algorithms from your own Java code.

Text Classification with R

The R language is an approachable programming language that is becoming increasingly popular among machine learning enthusiasts. Historically, it has been most widely used among academics and statisticians for statistical analysis, graphics representation, and reporting. According to KDnuggets, it’s currently the second most popular programming language for analytics, data science, and machine learning (while Python is #1).

R is an excellent choice for text classification tasks as it provides an extensive, coherent, and integrated collection of tools for data analysis.

Caret is a comprehensive package for building machine learning models in R. Short for “Classification and Regression Training,” it offers a simple interface for applying different algorithms and contains useful tools for text classification, like pre-processing, feature selection, and model tuning.

Mlr is another R package that provides a standardized interface for using classification and regression algorithms along with their corresponding evaluation and optimization methods.

SaaS Text Classification APIs

Open source tools are great, but they are mostly targeted at people with a background in machine learning. Also, they don’t provide an easy way to deploy and scale machine learning models, clean and curate data, tag training examples, do feature engineering, or bootstrap models.

You might be wondering, is there an easier way?

Well, if you want to avoid these hassles, a great alternative is to use a Software as a Service (SaaS) for text classification which usually solves most of the problems mentioned above. Another advantage is that they don’t require machine learning experience and even people who don’t know how to code can use and consume text classifiers. At the end of the day, leaving the heavy lifting to a SaaS can save you time, money, and resources when implementing your text classification system.

Some of the most remarkable SaaS solutions and APIs for text classification include:

MonkeyLearn
Google Cloud NLP
IBM Watson
Lexalytics
MeaningCloud
Amazon Comprehend
Aylien

Text Classification Tutorial

MonkeyLearn is an all-in-one text data analysis and visualization tool that makes it super easy to categorize your text, whether analyzing surveys, support tickets, reviews, and more... Once you’ve run your data through a series of analysis techniques, you’ll be able to see your results in striking detail.

In this tutorial we’re going to focus on analyzing and categorizing a set of reviews using sentiment and topic analysis. Follow along, then test our tools for yourself.

1. Choose the ‘Reviews Analysis’ template to create your workflow