Text Classification with Machine Learning & NLP

What is Text Classification?

Unstructured data in the form of text is everywhere: social media, emails, chats, web pages, support tickets, survey responses, and more. Text can be an extremely rich source of information, but extracting insights from it can be hard and time-consuming, due to its unstructured nature.

Tag sentiments & topics in text with NLP

Enter text classification: a process of assigning tags or categories to text according to its content. Text classification helps businesses automatically structure and analyze text, quickly and cost-effectively, with the aim of automating processes and enhancing data-driven decisions.

Read on to learn the basics of text classification, how it works, and how easy it is to get started with no-code tools like MonkeyLearn.

  1. What is Text Classification?
  2. How Does Text Classification Work?
  3. Text Classification Examples
  4. Resources

What is Text Classification?

What is Text Classification?

Text classification (a.k.a. text categorization or text tagging) is the task of assigning a set of predefined categories to open-ended text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web. For example, new articles can be organized by topics; support tickets can be organized by urgency; chat conversations can be organized by language; brand mentions can be organized by sentiment; and so on.

Text classification is one of the fundamental tasks in natural language processing with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Here’s an example of how it works:

“The user interface is quite straightforward and easy to use.”

A text classifier can take this phrase as an input, analyze its content, and then automatically assign relevant tags, such as UI and Easy To Use.

Input text is processed by a text classification model and delivers output tags.

How Does Text Classification Work?

How Does Text Classification Work?

Text classification can be done two different ways: manual or automatic classification. In the former, a human annotator interprets the content of text and categorizes it accordingly. This method can provide good results but it’s time-consuming and expensive. The latter applies machine learning, natural language processing (NLP), and other AI-guided techniques to automatically classify text in a faster, more cost-effective, and more accurate manner.

There are many approaches to automatic text classification, but they all fall under three types of systems:

  • Rule-based systems
  • Machine learning-based systems
  • Hybrid systems

Rule-based systems

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule consists of an antecedent or pattern and a predicted category.

Say that you want to classify news articles into two groups: Sports and Politics. First, you’ll need to define two lists of words that characterize each group (e.g., words related to sports such as football, basketball, LeBron James, etc., and words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.). Next, when you want to classify a new incoming text, you’ll need to count the number of sport-related words that appear in the text and do the same for politics-related words. If the number of sports-related word appearances is greater than the politics-related word count, then the text is classified as Sports and vice versa.

For example, this rule-based system will classify the headline “When is LeBron James' first game with the Lakers?” as Sports because it counted one sports-related term (LeBron James) and it didn’t count any politics-related terms.

Rule-based systems are human comprehensible and can be improved over time. But this approach has some disadvantages. For starters, these systems require deep knowledge of the domain. They are also time-consuming, since generating rules for a complex system can be quite challenging and usually requires a lot of analysis and testing. Rule-based systems are also difficult to maintain and don’t scale well given that adding new rules can affect the results of the pre-existing rules.

Machine learning based systems

Instead of relying on manually crafted rules, machine learning text classification learns to make classifications based on past observations. By using pre-labeled examples as training data, machine learning algorithms can learn the different associations between pieces of text, and that a particular output (i.e., tags) is expected for a particular input (i.e., text). A “tag” is the pre-determined classification or category that any given text could fall into.

The first step towards training a machine learning NLP classifier is feature extraction: a method is used to transform each text into a numerical representation in the form of a vector. One of the most frequently used approaches is bag of words, where a vector represents the frequency of a word in a predefined dictionary of words.

For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize the text “This is awesome,” we would have the following vector representation of that text: (1, 1, 0, 0, 1, 0, 0).

Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:

Training process in Text Classification

Once it’s trained with enough training samples, the machine learning model can begin to make accurate predictions. The same feature extractor is used to transform unseen text to feature sets, which can be fed into the classification model to get predictions on tags (e.g., sports, politics):

Prediction process in Text Classification

Text classification with machine learning is usually much more accurate than human-crafted rule systems, especially on complex NLP classification tasks. Also, classifiers with machine learning are easier to maintain and you can always tag new examples to learn new tasks.

Machine Learning Text Classification Algorithms

Some of the most popular text classification algorithms include the Naive Bayes family of algorithms, support vector machines (SVM), and deep learning.

Naive Bayes

The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis, overall.

One of the members of that family is Multinomial Naive Bayes (MNB) with a huge advantage, that you can get really good results even when your dataset isn’t very large (~ a couple of thousand tagged samples) and computational resources are scarce.

Naive Bayes is based on Bayes’s Theorem, which helps us compute the conditional probabilities of the occurrence of two events, based on the probabilities of the occurrence of each individual event. So we’re calculating the probability of each tag for a given text, and then outputting the tag with the highest probability.

The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of A being true, divided by the probability of B being true.

This means that any vector that represents a text will have to contain information about the probabilities of the appearance of certain words within the texts of a given category, so that the algorithm can compute the likelihood of that text’s belonging to the category.

Take a look at this blog post to learn more about Naive Bayes.

Support Vector Machines

Support Vector Machines (SVM) is another powerful text classification machine learning algorithm, becauseike Naive Bayes, SVM doesn’t need much training data to start providing accurate results. SVM does, however, require more computational resources than Naive Bayes, but the results are even faster and more accurate.

In short, SVM draws a line or “hyperplane” that divides a space into two subspaces. One subspace contains vectors (tags) that belong to a group, and another subspace contains vectors that do not belong to that group.

The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like this:

image15

Those vectors are representations of your training texts, and a group is a tag you have tagged your texts with.

As data gets more complex, it may not be possible to classify vectors/tags into only two categories. So, it looks like this:

image3

But that’s the great thing about SVM algorithms – they’re “multi-dimensional.” So, the more complex the data, the more accurate the results will be. Imagine the above in three dimensions, with an added Z-axis, to create a circle.

Mapped back to two dimensions the ideal hyperplane looks like this:

image22

Deep Learning

Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural networks. Deep learning architectures offer huge benefits for text classification because they perform at super high accuracy with lower-level engineering and computation.

The two main deep learning architectures for text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).

Deep learning is hierarchical machine learning, using multiple algorithms in a progressive chain of events. It’s similar to how the human brain works when making decisions, using different techniques simultaneously to process huge amounts of data.

Deep learning algorithms do require much more training data than traditional machine learning algorithms (at least millions of tagged examples). However, they don’t have a threshold for learning from training data, like traditional machine learning algorithms, such as SVM and NBeep learning classifiers continue to get better the more data you feed them with:

Deep Learning vs Traditional Machine Learning algorithms

Deep learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector representations for words and improve the accuracy of classifiers trained with traditional machine learning algorithms.

Hybrid Systems

Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been correctly modeled by the base classifier.

Metrics and Evaluation

Cross-validation is a common method to evaluate the performance of a text classifier. It works by splitting the training dataset into random, equal-length example sets (e.g., 4 sets with 25% of the data). For each set, a text classifier is trained with the remaining samples (e.g., 75% of the samples). Next, the classifiers make predictions on their respective sets, and the results are compared against the human-annotated tags. This will determine when a prediction was right (true positives and true negatives) and when it made a mistake (false positives, false negatives).

With these results, you can build performance metrics that are useful for a quick assessment on how well a classifier works:

  • Accuracy: the percentage of texts that were categorized with the correct tag.
  • Precision: the percentage of examples the classifier got right out of the total number of examples that it predicted for a given tag.
  • Recall: the percentage of examples the classifier predicted for a given tag out of the total number of examples it should have predicted for that given tag.
  • F1 Score: the harmonic mean of precision and recall.

Why is Text Classification Important?

It’s estimated that around 80% of all information is unstructured, with text being one of the most common types of unstructured data. Because of the messy nature of text, analyzing, understanding, organizing, and sorting through text data is hard and time-consuming, so most companies fail to use it to its full potential.

This is where text classification with machine learning comes in. Using text classifiers, companies can automatically structure all manner of relevant text, from emails, legal documents, social media, chatbots, surveys, and more in a fast and cost-effective way. This allows companies to save time analyzing text data, automate business processes, and make data-driven business decisions.

Why use machine learning text classification? Some of the top reasons:

  • Scalability

Manually analyzing and organizing is slow and much less accurate.. Machine learning can automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in just a few minutes. Text classification tools are scalable to any business needs, large or small.

  • Real-time analysis

There are critical situations that companies need to identify as soon as possible and take immediate action (e.g., PR crises on social media). Machine learning text classification can follow your brand mentions constantly and in real time, so you'll identify critical information and be able to take action right away.

  • Consistent criteria

Human annotators make mistakes when classifying text data due to distractions, fatigue, and boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand, applies the same lens and criteria to all data and results. Once a text classification model is properly trained it performs with unsurpassed accuracy.

Text Classification Examples

Text Classification Applications

Text classification can be used in a broad range of contexts such as classifying short texts (e.g., tweets, headlines, chatbot queries, etc.) or organizing much larger documents (e.g., customer reviews, news articles,legal contracts, longform customer surveys, etc.). Some of the most well-known examples of text classification include sentiment analysis, topic labeling, language detection, and intent detection.

Sentiment Analysis

Perhaps the most popular example of text classification is sentiment analysis (or opinion mining): the automated process of reading a text for opinion polarity (positive, negative, neutral, and beyond). Companies use sentiment classifiers on a wide range of applications, like product analytics, brand monitoring, market research, customer support, workforce analytics, and much more.

Sentiment analysis allows you to automatically analyze all forms of text for the feeling and emotion of the writer.

Try out this pre-trained sentiment classifier with your own text to see just how easy it is to do.

Test with your own text

Results

TagConfidence
Positive99.9%

If you see an odd result, don’t worry, it’s just because it hasn’t been trained (yet) with similar expressions. For super accurate results trained to the specific language and criteria of your business, follow this quick sentiment analysis tutorial to build a custom sentiment analysis model in just five steps.

Topic Labeling

Another common example of text classification is topic labeling, that is, understanding what a given text is talking about. It’s often used for structuring and organizing data, such as organizing customer feedback by topic or organizing news articles by subject.

Try out this pre-trained model for classifying NPS responses for SaaS products according to their topic. It tags customer feedback by categories: Customer Support, Ease of Use, Features, and Pricing:

Test with your own text

Results

TagConfidence
Customer Support84.5%

Learn more about topic labeling and how to build a custom multi-label text classifier.

Language Detection

Language detection is another great example of text classification, that is, the process of classifying incoming text according to its language. These text classifiers are often used for routing purposes (e.g., route support tickets according to their language to the appropriate team).

The following is a classifier trained for detecting 49 different languages in text:

Test with your own text

Results

TagConfidence
English-en100.0%

Intent Detection

Intent detection or intent classification is another great use case for text classification that analyzes text to understand the reason behind feedback. Maybe it’s a complaint, or maybe a customer is expressing intent to purchase a product. It’s used for customer service, marketing email responses, generating product analytics, and automating business practices. Intent detection with machine learning can read emails and chatbot conversations and automatically route them to the correct department.

Try out this email intent classifier that’s trained to detect the intent of email replies. It classifies with tags: Interested, Not Interested, Unsubscribe, Wrong Person, Email Bounce, and Autoresponder:

Test with your own text

Results

TagConfidence
Interested100.0%

Text Classification Applications

Text classification has thousands of use cases and is applied to a wide range of tasks. In some cases, data classification tools work behind the scenes to enhance app features we interact with on a daily basis (like email spam filtering). In some other cases, classifiers are used by marketers, product managers, engineers, and salespeople to automate business processes and save hundreds of hours of manual data processing.

Some of the top applications for text classification:

  • Social media monitoring
  • Brand monitoring
  • Customer service
  • Voice of customer (VoC)

Social Media Monitoring

According to Hootsuite, nearly half of Americans have interacted with companies or institutions on at least one of their social media networks. All of these interactions represent a lot of actionable insights for any business. Just on Twitter alone, users send 500 million tweets every day. Furthermore, surveys show that 83% of customers who comment or complain on social media expect a response the same day, with 18% expecting it to come immediately. What are people complaining about when they mention a particular brand? What are they praising? How have they reacted to a specific message or campaign?

The answers to these questions can be found within the sea of data available on social media, but without the help of machine learning text analysis, making sense of all this data would be extremely costly, time-consuming, and much less accurate – if even possible. Fortunately, with machine learning you can analyze social media data in a scalable, cost-effective, downright easy way. For example, you can leverage aspect-based sentiment analysis over a period of time to understand what people are talking about on social media, the feelings and emotions behind these conversations, and track trends over time. Use text classification get actionable insights, like:

  • Detect potential PR crises about to burst;
  • Keep an eye on the competition and detect sales opportunities when a customer complains about a competitor’s product or service on social media;
  • Detect complaints about downtime or bugs on social media and alert the product team;
  • Identify social media comments seeking help and automatically route them to the support team.

Example: A machine learning analysis of the Brexit result

When the UK voted to leave the European Union, we decided to analyze conversations on Twitter by gathering more than 450,000 tweets containing the #Brexit hashtag.

Using text classification techniques, sentiment analysis and keyword extraction, we were able to confirm that opinions were extremely divided, with slightly more people talking negatively about the results:

Sentiment from tweets with the #Brexit hashtag

On the one hand, those that tweeted ‘positively’ about Brexit were saying it was a “good thing,” and were happy about the “independence of the UK”:

Positive keywords from tweets with the #Brexit hashtag

On the other hand, those that expressed negative sentiments mentioned that this was a "sad day" and focused on the "people".

Negative keywords from tweets with the #Brexit hashtag

The results also uncovered very negative comments about UK politicians, such as David Cameron (48% more negative tweets than positive) and pro-Brexit politician, Nigel Farage (272% more negative tweets than positive). Even Donald Trump was part of the Brexit conversation with a very polarized sentiment, with 2808 positive tweets and 3208 negative tweets.

Brand Monitoring

Online conversations around a brand and its competitors heavily influence consumers. Some blogs, forums, review sites, and influencers are becoming even more important than traditional outlets. According to MineWhat, 81% of buyers conduct online research before making a purchase. Consumers care immensely about what people are saying online about a brand – BrightLocal states that 85% of consumers trust online reviews as much as personal recommendations.

So, online conversation matters – and that's why you need to create and maintain a process that keeps a close watch on your brand mentions, extracts insights to help drive decisions, and allows you to take action when needed.

With the help of automatic data classification tools, you can categorize brand mentions all over the internet to find more about the following topics:

  • Features: Are people talking about a particular aspect of your product or service?
  • Wishes: Are your customers expressing particular desires?
  • Price: Is your brand perceived as good value for money? Is it considered cheap or expensive?
  • Use cases: How do your customers use your product?
  • Competitors: How does your brand compare to competitors? What are your strengths and weaknesses?

You can create custom text classifiers to help you identify these topics every time someone shares something about your brand, 24/7 and in real time. Moreover, you can combine these topic classifiers with sentiment analysis models to get a real-time thermometer about your online presence.

Example: Sentiment analysis of Slack reviews

We scraped +4,500 Slack reviews from Capterra and used text classification to understand which aspects users love or hate about Slack. The results revealed that customers are satisfied with Slack overall, with most company reviews containing positive sentiments:

Sentiment breakdown for Slack reviews

We also performed aspect-based sentiment analysis on the reviews to understand which aspects people are praising or complaining about. The results showed that users are very positive about the software’s Ease of use, Integrations, and File Sharing System, but more negative about the Search Tool, Notifications, Pricing, and Performance, Quality and Reliability:

Sentiment breakdown per aspect of Slack reviews

Customer Service

Building a good customer experience is one of the foundations of a sustainable and growing company. According to Hubspot, people are 93% more likely to be repeat customers at companies with excellent customer service. The study also unveiled that 80% of respondents said they had stopped doing business with a company because of a poor customer experience.

Text classification can help support teams provide a stellar experience by automating tasks that are better left to computers, saving precious time that can be spent on more important things.

For instance, text classification is often used for automating ticket routing and triaging. Imagine a global company that provides customer support in several languages; this involves the process of assigning tickets based on the ticket’s language. To do this, a person is needed to manually assign the ticket to the correct team who can understand and reply to the customer in the right language. With text classification, instead of using humans you can use a language detection classifier to do this task for you.

Text classification can also be used for routing support tickets to a teammate with specific product expertise. For instance, if a customer writes in asking about refunds, you can automatically assign the ticket to the teammate with permission to perform refunds. This will ensure the customer gets a quality response more quickly. Without the need to triage every single ticket, support teams can work more efficiently and reduce response times. You’ll always know that tickets have been automatically routed to the most appropriate team.

Support teams can also use text classification to automatically detect the urgency of a support ticket and prioritize accordingly. By using machine learning to set priorities, you can ensure your team is always working on the most urgent tickets, every time.

Companies are also leveraging text classification for getting insights from support conversations, thus improving their reporting and analytics. You can even use customer complaint classification to find complaints wherever they may exist, from all over the web.

Example: Analyzing customer support interactions on Twitter

Knowing how to talk to customers on social media isn’t always easy.

To find out which approach worked best, we used keyword extraction and sentiment analysis to analyze 200,000+ customer support interactions on Twitter from competing mobile carriers: Verizon, T-Mobile, AT&T, and Sprint.

First, we analyzed the most relevant keywords in all these tweets to find out the type of language each carrier uses. For instance, T-Mobile’s customer service is friendlier and more personal, and every support agent signs off each message with their name. Verizon tweets, on the other hand, sound very formal.

Next, we performed sentiment analysis on the twitter data, and discovered that T-Mobile’s to customer support team elicit more positive responses, and fewer negative responses overall:

Percentage of positive tweets per carrier

Percentage of negative tweets per carrier

Voice of Customer (VoC)

Companies leverage surveys such as Net Promoter Score to listen to the voice of their customers at every stage of the journey.

The information gathered is both qualitative and quantitative, and while NPS scores are easy to analyze, open-ended responses require a more in-depth analysis using text classification techniques. Instead of relying on humans to analyze voice of customer data, you can quickly process open-ended customer feedback with machine learning. Classification models can help you analyze survey results to discover patterns and insights like:

  • What do people like about our product or service?
  • What should we improve?
  • What do we need to change?

By combining both quantitative results and qualitative analyses, teams can make more informed decisions without having to spend hours manually analyzing every single open-ended response.

Example: How Retently Automated Customer Feedback Analysis Using MonkeyLearn

Retently wanted to figure out what was driving their NPS score to help them prioritize their product roadmap. But, manually sorting through their open-ended feedback was taking far too long, so they turned to text classification to automate this process.

After training a classifier using their own data and criteria, they were able to classify open-ended responses using the following tags:

Tags used by Retently to tag NPS feedback

Excited about the results of the classifier, Retently decided to implement a new reporting system that showed customers’ priorities in their own words:

NPS response tags analysis

This new report system allowed Retently to discover actionable insights that have helped them drive strategic decisions and provide a better customer experience.

Text Classification Resources

Text Classification Resources

Once you start to automate manual and repetitive tasks using all manner of text classification techniques, you can focus on other areas of your business.

But… how the heck do you get started with text classification? There’s so much information about text analysis, machine learning, and natural language processing that it can be overwhelming.

At MonkeyLearn, we make it easy for you to know where to start. We provide a no-code text classifier builder, so you can build your very own text classifier in a few simple steps.

Building your first text classifier can help you really understand the benefits of text classification, but before we go into more detail about what MonkeyLearn can do, let’s take a look at what you’ll need to create your own text classification model:

1. Datasets

A text classifier is worthless without accurate training data. Machine learning algorithms can only make accuaret predictions by learning from previous examples.

You show an algorithm examples of correctly tagged data, and it uses that tagged data to make predictions on unseen text.

Say you want to predict the intent of chat conversations; you’ll need to identify and gather chat conversations that represent the different intents you want to predict. If you train your model with another type of data, the classifier will provide poor results.

So, how do you get training data?

You can use internal data generated from the apps and tools you use every day, like CRMs (e.g. Salesforce, Hubspot), chat apps (e.g. Slack, Drift, Intercom), help desk software (e.g. Zendesk, Freshdesk, Front), survey tools (e.g. SurveyMonkey, Typeform, Google Forms), and customer satisfaction tools (e.g. Promoter.io, Retently, Satismeter). These tools usually provide an option to export data in a CSV file that you can use to train your classifier.

Another option is using external data from throughout the web, either by using web scraping, APIs, or public datasets.

The following are some publicly available datasets you can use for building your first text classifier and start experimenting right away.

Topic classification:

  • Reuters news dataset: probably one the most widely used dataset for text classification; it contains 21,578 news articles from Reuters labeled with 135 categories according to their topic, such as Politics, Economics, Sports, and Business.

  • 20 Newsgroups: another popular datasets that consists of ~20,000 documents across 20 different topics.

Sentiment analysis:

  • Amazon Product Reviews: a well-known dataset that contains ~143 million reviews and star ratings (1 to 5 stars) spanning May 1996 - July 2014. You can get an alternative dataset for Amazon product reviews here.

  • IMDB reviews: a much smaller dataset with 25,000 movie reviews labeled as positive and negative from the Internet Movie Database (IMDB).

  • Twitter Airline Sentiment: this dataset contains around 15,000 tweets about airlines labeled as positive, neutral, and negative.

Other popular datasets:

  • Spambase: a dataset with 4,601 emails labeled as spam and not spam.

  • SMS Spam Collection: another dataset for spam detection that consists of 5,574 SMS messages tagged as spam or legitimate.

  • Hate speech and offensive language: this dataset contains 24,802 labeled tweets organized into three categories: clean, hate speech, and offensive language.

2. Text Classification Tools

Alright. Now that you have training data, it's time to feed it to a machine learning algorithm and create a text classifier.

So, how do we do this?

Luckily, many resources can help you during the different phases of the process, i.e. transforming texts into vectors, training a machine learning algorithm, and using a model to make predictions. Broadly speaking, these tools can be classified into two different categories:

It’s an ongoing debate: Build vs. Buy. Open-source libraries can perform among the upper echelon of machine learning text classification tools, but they’re costly and time-consuming to build and require years of data science and computer engineering experience.

SaaS tools, on the other hand, require little to no code, are completely scalable and much less costly, as you only use the tools you need. Best of all, most can be implemented right away and trained (often in just a few minutes) to perform just as fast and accurately.

Open-source libraries for text classification

One of the reasons machine learning has become mainstream is thanks to the myriad of open source libraries available for developers interested in applying it. Although they require a hefty data science and machine learning background these libraries offer a fair level of abstraction and simplification. Python, Java, and R all offer a wide selection of machine learning libraries that are actively developed and provide a diverse set of features, performance, and capabilities.

Text Classification with Python

Python is usually the programming language of choice for developers and data scientists who work with machine learning models. The simple syntax, its massive community, and the scientific-computing friendliness of its mathematical libraries are some of the reasons why Python is so prevalent in the field.

Scikit-learn is one of the go-to libraries for general purpose machine learning. It supports many algorithms and provides simple and efficient features for working with text classification, regression, and clustering models. If you are a beginner in machine learning, scikit-learn is one of the most friendly libraries for getting started with text classification, with dozens of tutorials and step-by-step guides all over the web.

NLTK is a popular library focused on natural language processing (NLP) that has a big community behind it. It's super handy for text classification because it provides all kinds of useful tools for making a machine understand text, such as splitting paragraphs into sentences, splitting up words, and recognizing the part of speech of those words.

A modern and newer NLP library is SpaCy, a toolkit with a more minimal and straightforward approach than NLTK. For example, spaCy only implements a single stemmer (NLTK has 9 different options). SpaCy has also integrated word embeddings, which can be useful to help boost accuracy in text classification.

Once you are ready to experiment with more complex algorithms, you should check out deep learning libraries like Keras, TensorFlow, and PyTorch. Keras is probably the best starting point as it's designed to simplify the creation of recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

TensorFlow is the most popular open source library for implementing deep learning algorithms. Developed by Google and used by companies, such as Dropbox, eBay, and Intel, this library is optimized for setting up, training, and deploying artificial neural networks with massive datasets. Although it’s harder to master than Keras, it’s the undisputed leader in the deep learning space. A reliable alternative to TensorFlow is PyTorch, an extensive deep learning library primarily developed by Facebook and backed by Twitter, Nvidia, Salesforce, Stanford University, University of Oxford, and Uber.

Text Classification with Java

Another programming language that is broadly used for implementing machine learning models is Java. Like Python, it has a big community, an extensive ecosystem, and a great selection of open source libraries for machine learning and NLP.

CoreNLP is the most popular framework for NLP in Java. Created by Stanford University, it provides a diverse set of tools for understanding human language such as a text parser, a part-of-speech (POS) tagger, a named entity recognizer (NER), a coreference resolution system, and information extraction tools.

Another popular toolkit for natural language tasks is OpenNLP. Created by The Apache Software Foundation, it provides a bunch of linguistic analysis tools useful for text classification such as tokenization, sentence segmentation, part-of-speech tagging, chunking, and parsing.

Weka is a machine learning library developed by the University of Waikato and contains many tools like classification, regression, clustering, and data visualization. It provides a graphical user interface for applying Weka’s collection of algorithms directly to a dataset, and an API to call these algorithms from your own Java code.

Text Classification with R

The R language is an approachable programming language that is becoming increasingly popular among machine learning enthusiasts. Historically, it has been most widely used among academics and statisticians for statistical analysis, graphics representation, and reporting. According to KDnuggets, it’s currently the second most popular programming language for analytics, data science, and machine learning (while Python is #1).

R is an excellent choice for text classification tasks as it provides an extensive, coherent, and integrated collection of tools for data analysis.

Caret is a comprehensive package for building machine learning models in R. Short for “Classification and Regression Training,” it offers a simple interface for applying different algorithms and contains useful tools for text classification, like pre-processing, feature selection, and model tuning.

Mlr is another R package that provides a standardized interface for using classification and regression algorithms along with their corresponding evaluation and optimization methods.

SaaS Text Classification APIs

Open source tools are great, but they are mostly targeted at people with a background in machine learning. Also, they don’t provide an easy way to deploy and scale machine learning models, clean and curate data, tag training examples, do feature engineering, or bootstrap models.

You might be wondering, is there an easier way?

Well, if you want to avoid these hassles, a great alternative is to use a Software as a Service (SaaS) for text classification which usually solves most of the problems mentioned above. Another advantage is that they don’t require machine learning experience and even people who don’t know how to code can use and consume text classifiers. At the end of the day, leaving the heavy lifting to a SaaS can save you time, money, and resources when implementing your text classification system.

Some of the most remarkable SaaS solutions and APIs for text classification include:

  • MonkeyLearn
  • Google Cloud NLP
  • IBM Watson
  • Lexalytics
  • MeaningCloud
  • Amazon Comprehend
  • Aylien

Text Classification Tutorial with MonkeyLearn

The best way to learn about text classification is to get your feet wet and build your first classifier. If you don’t want to invest too much time learning about machine learning or deploying the required infrastructure, you can use MonkeyLearn, a platform that makes it super easy to build, train, and consume text classifiers. And once you’ve built your classifier, you can see your results in striking detail with MonkeyLearn Studio. Sign up for free and build your own classifier following these four simple steps:

1. Create a new text classifier:

Go to the dashboard, then click Create a Model, and choose Classifier:

Create a text classifier and choose the model type

2. Upload training data:

Next, you’ll need to upload the data that you want to use as examples for training your model. You can upload a CSV or Excel file or import your text data directly from a 3rd party app such as Twitter, Gmail, Zendesk, or RSS feeds:

Import data to your classifier

3. Define the tags for your model:

The next step is to define the tags you want to use for your text classifier:

Define the tags for your classifier

Once the classifier has been trained, incoming data will be automatically categorized into the tags you specify in this step. Try avoiding using tags that are overlapping or ambiguous as this can cause confusion and can make the classifier’s accuracy worse.

4. Tag data to train the classifier:

Finally, you’ll need to tag each example with the expected category to start training the machine learning model:

Tag data to train your classifier

As you tag data, the classifier will learn to recognize similar patterns when presented with new text and make an accurate classification. Remember: the more data you tag, the more accurate the model will be.

Testing the classifier

Once you’ve finished the creation wizard, you will be able to test the classifier in "Run" > “Demo” and see how the model classifies the texts you write:

Testing the classifier

MonkeyLearn provides some useful tools for understanding how well the model is working such as classifier stats (e.g. accuracy, F1 score, precision, and recall) and a keyword cloud of n-grams for each category. There are multiple ways for improving the accuracy of your classifier, including tagging more training data, going through the false positives and false negatives and retag the incorrectly labeled examples, and cleaning your data to disassociate keywords with a specific tag.

Integrating the classifier

Once the predictions are good enough, the model will be ready to categorize new unseen text. MonkeyLearn provides different ways to achieve this: batch processing, API, or integrations.

You can upload a CSV or Excel file to classify text in a batch in "Run" > “Batch”:

Classify text in a batch with an Excel or CSV file

After uploading the file, the classifier will analyze the data and return a new file with the same data plus the predictions.

Alternatively, you can use MonkeyLearn API to classify new data programmatically:

Classify new text programmatically via an API

Another possibility is to use one of the available integrations to put the classifier to work and automatically categorize incoming text in your favorite apps with zero lines of code:

Classify new text via integrations

Visualize Your Classification Data

Now that you’ve built a classifier, it’s time to make your results shine in vivid visual detail. Business intelligence visualization platforms allow you to see a broad data overview or fine-grained results.

MonkeyLearn Studio is an all-in-one text data analysis and visualization tool. Choose the classification (and other) techniques you need and perform them together – from data collection, to organization, analysis, and visualization. It all works in a single, seamless interface.

Take a look at the MonkeyLearn Studio dashboard. Search by aspect, sentiment, etc. You can add or remove analyses or change data right in the browser dashboard and see the results instantly.

Take a look at the example below, where we performed aspect-based sentiment analysis on customer reviews of Zoom. Each piece of feedback is categorized by Usability, Support, Reliability, etc., then sentiment analyzed to show the opinion of the writer.

MonkeyLearn Studio dashboard showing results for intent classification and sentiment analysis in charts and graphs.

Individual reviews are organized by date and time to follow categories and sentiments as they change over time.

Play around with the MonkeyLearn Studio public dashboard to see just how easy it is to use.

Takeaway

Text classification can be your new secret weapon for building cutting-edge systems and organizing business information. Turning your text data into quantitative data is incredibly helpful to get actionable insights and drive business decisions. Also, automating manual and repetitive tasks will help you get more done.

Are you interested in creating your first text classifier? Visit MonkeyLearn and start experimenting right away. You can quickly create text classifiers with machine learning by using our easy-to-use UI (no coding required!) and put them to work by using our API or integrations.

Visit MonkeyLearn Studio and request a demo to see what text analysis and data visualization can do for your business.

Have questions? Reach out and we’ll help you get started with text classification.

MonkeyLearn Inc. All rights reserved 2020