Text Classification

A gentle introduction to classifying text

Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Unstructured data in the form of text is everywhere: emails, chats, web pages, social media, support tickets, survey responses, and more. Text can be an extremely rich source of information, but extracting insights from it can be hard and time-consuming due to its unstructured nature. Businesses are turning to text classification for structuring text in a fast and cost-efficient way to enhance decision-making and automate processes.

But, what is text classification? How does text classification work? What are the algorithms used for classifying text? What are the most common business applications?

We’ll try to answer those questions in this guide. Some of the topics you’ll find on this guide include the following:

  1. What is Text Classification?
  2. How Does Text Classification Work?

  3. Why is Text Classification Important?

  4. Examples

Let’s dive in!

What is Text Classification?

Text classification (a.k.a. text categorization or text tagging) is the task of assigning a set of predefined categories to free-text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new articles can be organized by topics, support tickets can be organized by urgency, chat conversations can be organized by language, brand mentions can be organized by sentiment, and so on.

As an example, take a look at the following text below:

“The user interface is quite straightforward and easy to use.”

A classifier can take this text as an input, analyze its content, and then and automatically assign relevant tags, such as UI and Easy To Use that represent this text:

What is Text Classification

How Does Text Classification Work?

Text classification can be done in two different ways: manually and automatically classification. In the former, a human annotator interprets the content of text and categorizes it accordingly. This method usually can provide quality results but it’s time-consuming and expensive. The latter applies machine learning, natural language processing, and other techniques to automatically classify text in a faster and more cost-effective way.

There are many approaches to automatic text classification, which can be grouped into three different types of systems:

  • Rule-based systems
  • Machine Learning based systems
  • Hybrid systems

Rule-based Systems

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule consists of an antecedent or pattern and a predicted category.

Say that you want to classify news articles into 2 groups, namely, Sports and Politics. First, you’ll need to define two lists of words that characterize each group (e.g. words related to sports such as football, basketball, LeBron James, etc., and words related to politics such as Donald Trump, Hillary Clinton, Putin, etc.). Next, when you want to classify a new incoming text, you’ll need to count the number of sport-related words that appear in the text and do the same for politics-related words. If the number of sport-related word appearances is greater than the number of politics-related word count, then the text is classified as sports and vice versa.

For example, this rule-based system will classify the headline “When is LeBron James' first game with the Lakers?” as Sports because it counted 1 sport-related term (Lebron James) and it didn’t count any politics-related terms.

Rule-based systems are human comprehensible and can be improved over time. But this approach has some disadvantages. For starters, these systems require deep knowledge of the domain. They are also time-consuming, since generating rules for a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-based systems are also difficult to maintain and don’t scale well given that adding new rules can affect the results of the pre-existing rules.

Machine Learning Based Systems

Instead of relying on manually crafted rules, text classification with machine learning learns to make classifications based on past observations. By using pre-labeled examples as training data, a machine learning algorithm can learn the different associations between pieces of text and that a particular output (i.e. tags) is expected for a particular input (i.e. text).

The first step towards training a classifier with machine learning is feature extraction: a method is used to transform each text into a numerical representation in the form of a vector. One of the most frequently used approaches is bag of words, where a vector represents the frequency of a word in a predefined dictionary of words.

For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize the text “This is awesome”, we would have the following vector representation of that text: (1, 1, 0, 0, 1, 0, 0).

Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:

Training process in Text Classification

Once it’s trained with enough training samples, the machine learning model can begin to make accurate predictions. The same feature extractor is used to transform unseen text to feature sets which can be fed into the classification model to get predictions on tags (e.g. sports, politics):

Prediction process in Text Classification

Text classification with machine learning is usually much more accurate than human-crafted rule systems, especially on complex classification tasks. Also, classifiers with machine learning are easier to maintain and you can always tag new examples to learn new tasks.

Text Classification Algorithms

Some of the most popular machine learning algorithms for creating text classification models include the naive bayes family of algorithms, support vector machines, and deep learning.

Naive Bayes

Naive Bayes is a family of statistical algorithms we can make use of when doing text classification. One of the members of that family is Multinomial Naive Bayes (MNB). One of its main advantages is that you can get really good when data available is not much (~ a couple of thousand tagged samples) and computational resources are scarce.

All you need to know is that Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional probabilities of occurrence of two events based on the probabilities of occurrence of each individual event. This means that any vector that represents a text will have to contain information about the probabilities of appearance of the words of the text within the texts of a given category so that the algorithm can compute the likelihood of that text’s belonging to the category.

Support Vector Machines

Support Vector Machines (SVM) is just one out of many algorithms we can choose from when doing text classification. Like naive bayes, SVM doesn’t need much training data to start providing accurate results. Although it needs more computational resources than naive bayes, SVM can achieve more accurate results.

In short, SVM takes care of drawing a “line” or hyperplane that divides a space into two subspaces: one subspace that contains vectors that belong to a group and another subspace that contains vectors that do not belong to that group. Those vectors are representations of your training texts and a group is a tag you have tagged your texts with.

Deep Learning

Deep learning is a set of algorithms and techniques inspired by how the human brain works. Text classification has benefited from the recent resurgence of deep learning architectures due to their potential to reach high accuracy with less need of engineered features. The two main deep learning architectures used in text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

On the one hand, deep learning algorithms require much more training data than traditional machine learning algorithms, i.e. at least millions of tagged examples. On the other hand, traditional machine learning algorithms such as SVM and NB reach a certain threshold where adding more training data doesn’t improve their accuracy. In contrast, deep learning classifiers continue to get better the more data you feed them with:

Deep Learning vs Traditional Machine Learning algorithms

Deep learning algorithms such as Word2Vec or GloVe are also used in order to obtain better vector representations for words and improve the accuracy of classifiers trained with traditional machine learning algorithms.

Hybrid Systems

Hybrids systems combine a base classifier trained with machine learning and a rule-based system, which is used to further improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been correctly modeled by the base classifier.

Metrics and Evaluation

Cross-validation is a common method to evaluate the performance of a text classifier. It consists in splitting the training dataset randomly into equal-length sets of examples (e.g. 4 sets with 25% of the data). For each set, a text classifier is trained with the remaining samples (e.g. 75% of the samples). Next, the classifiers make predictions on their respective sets and the results are compared against the human-annotated tags. This allows finding when a prediction was right (true positives and true negatives) and when it made a mistake (false positives, false negatives).

With these results, you can build performance metrics that are useful for a quick assessment on how well a classifier works:

  • Accuracy: the percentage of texts that were predicted with the correct tag.
  • Precision: the percentage of examples the classifier got right out of the total number of examples that it predicted for a given tag.
  • Recall: the percentage of examples the classifier predicted for a given tag out of the total number of examples it should have predicted for that given tag.
  • F1 Score: the harmonic mean of precision and recall.

Why is Text Classification Important?

According to IBM, it is estimated that around 80% of all information is unstructured, with text being one of the most common types of unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data is hard and time-consuming so most companies fail to extract value from that.

This is where text classification with machine learning steps in. By using text classifiers, companies can structure business information such as email, legal documents, web pages, chat conversations, and social media messages in a fast and cost-effective way. This allows companies to save time when analyzing text data, help inform business decisions, and automate business processes.

Some of the reasons why companies are leveraging text classification with machine learning are the following:

  • Scalability

Manually analyzing and organizing text takes time. It’s a slow process where a human needs to read each text and decide how to structure it. Machine learning changes this and enables to easily analyze millions of texts at a fraction of a cost.

  • Real-time analysis

There are critical situations that companies need to identify as soon as possible and take immediate action (e.g. PR crises on social media). Text classifiers with machine learning can make accurate precisions in real-time that enable companies to identify critical information instantly and take action right away.

  • Consistent criteria

Human annotators make mistakes when classifying text data due to distractions, fatigue, and boredom. Other errors are generated due to inconsistent criteria. In contrast, machine learning applies the same lens and criteria to all of the data, thus allowing humans to reduce errors with centralized text classification models.

Text Classification Examples

Text classification can be used in a broad range of contexts such as classifying short texts (e.g. as tweets, headlines or tweets) or organizing much larger documents (e.g. customer reviews, media articles or legal contracts). Some of the most well-known examples of text classification include sentiment analysis, topic labeling, language detection, and intent detection.

Sentiment Analysis

Probably the most common example of text classification is sentiment analysis: the automated process of determining whether a text is positive, negative, or neutral. Companies are using sentiment classifiers on a wide range of applications, such as product analytics, brand monitoring, customer support, market research, workforce analytics, and much more.

This is a pre-trained classifier using MonkeyLearn for classifying text in English according to their sentiment. Feel free to experiment and try different expressions to see the classifier makes the predictions:

If you see an odd result, don’t worry, it’s probably because it hasn’t been trained (yet) with similar expressions. As an alternative, you can build a custom classifier for sentiment analysis and get more appropriate results for your data and criteria.

Topic Labeling

Another common example of text classification is topic labeling, that is, understanding what a given text is talking about. It’s often used for structuring and organizing data such as organizing customer feedback by its topic or organizing news articles according to their subject.

The following is a pre-trained model for classifying NPS responses for SaaS products according to their topic. It can tag customer feedback in categories such as Customer Support, Ease of Use, Features, and Pricing:

Language Detection

Language detection is another great example of text classification, that is, the process of classifying incoming text according to its language. These text classifiers are often used for routing purposes (e.g. route support tickets according to their language to the appropriate team).

The following is a classifier trained for detecting 49 different languages in text:

Intent Detection

Companies are also using text classifiers for automatically detecting the intent from customer conversations, often used for generating product analytics or automating business purposes.

For example, the following classifier was trained for detecting the intent from replies in outbound sales emails. It can classifier answers in tags such as Interested, Not Interested, Unsubscribe, Wrong Person, Email Bounce, and Autoresponder:

Takeaway

Text classification can be your new secret weapon for building cutting-edge systems and organizing business information. Turning your text data into quantitative data is incredibly helpful to get actionable insights that can drive business decisions. You can also automate manual and repetitive tasks and get more done.

Are you interested in creating your first text classifier? You can signup to MonkeyLearn for free and start experimenting right away. You can quickly create text classifiers with machine learning by using our easy-to-use UI (no coding required!) and put them to work by using our API or integrations.

Have questions? Reach out and we’ll help you get started with text classification.

Start using Text Classification today!

Automate business processes and save hours of manual data processing.