Use Cases

Text Classification with Python

Text Classification with Python

Classifying texts is a difficult task, especially if your business is dealing with large volumes of data. Did you know that 2.5 quintillion bytes of information are generated every day? That’s a lot of social media posts, emails, surveys, live chats, and so on, with billions of valuable insights that help companies make data-based decisions.

Classifying all this text data manually is tedious work, not to mention time-consuming. You will need a top-notch team who knows your tagging structure inside out, for starters. But, no matter how hard-working and efficient, they’ll probably grow tired of this repetitive task and make mistakes if they have to do it on a daily basis.

So, why not automate your text classification process with Python? This is one of the most popular programming languages, and when combined with machine learning models for text classification it can do wonders for your business. Contrary to what most people may think, it’s easy to get started with these tools. Below, you’ll find a detailed guide on how to start using text classification with Python and everything else you need to know about these tools: 

Let’s get going!

What Is Text Classification? 

Text classification (also known as text tagging or text categorization) is a process in which texts are sorted into categories. For example, you can classify news articles by topic, customer feedback by sentiment, support tickets by urgency, and so on. This process can be done either manually (carried out by a human agent who reads texts and categorizes them) or automatically (which involves machine learning processes and algorithms that will classify your texts in a faster and more cost-effective way). 

Companies receive text data all the time. Be it emails, chats, social media comments, support tickets, or NPS responses, all these texts are very rich sources of information. However, they are not structured, so you have to tag and analyze these texts before you can make sense of them and obtain insights. 

You may have an excellent team to carry out these tasks, but classifying text can be slow, boring, and ineffective. That is why many companies today are choosing text classification with machine learning to automate the tagging process in a much more efficient way. 

Picture this situation: you launch a new product and ask your customers to fill in a satisfaction survey. Your product ends up being far more popular than you had expected, and you receive 50,000 survey responses. That’s wonderful, but you have a very small team to analyze all these responses. Besides, it would take forever to get through them all and reply to any urgent issues. Should you start hiring new people right away to solve this problem? The simple answer is no. Instead, text classification with Python can help to automatically sort this data, get better insights and automate processes. 

Tools for Using Text Classification with Python

Implementing text classification with Python can be a daunting task, especially when creating a classifier from scratch. 

Luckily, there are many resources that can help you carry out this process, whether you choose to use open-source or SaaS tools.

Open-Source Libraries for Text Classification

Python is the preferred programming language when it comes to text classification with AI because of its simple syntax and the number of open-source libraries available. One of them is Scikit-Learn, used for general-purpose machine learning, and one of the most user-friendly libraries available, as it comes with many resources and tutorials. You can also use NLTK, another library with a focus on Natural Language Processing (NLP). It helps split your texts into paragraphs, sentences, and even parts of speech for your model to be able to classify them easily. As well as Scikit and NLTK, you can use SpaCy, a library that uses deep learning for building sophisticated models for a variety of NLP problems. It has only one stemmer, and word embeddings that will render your model very accurate. After you master the use of complex algorithms, you may want to try out Keras, a user-friendly API that puts user experience first. 

TensorFlow is another option used by experts to implement text classification with deep learning. You may also want to give PyTorch a try, as its deep integration with popular libraries makes it easy to write neural network layers in Python. 

Open source tools are great because they offer great flexibility. On the downside, creating a machine learning model for classifying your texts using open-source tools is not easy. You will need time and resources to build the tool, and even the help of data scientists to gather data, train the model, and build the necessary infrastructure for running a text classification system until they are ready to give you reliable, accurate predictions. 

SaaS APIs for Text Classification

Alternatively, SaaS APIs such as MonkeyLearn can save you a lot of time, money, and resources when implementing a text classification system. You only need to enter a few lines of code, and you will not have to worry about building the infrastructure or learning the ins and outs of machine learning. Just use MonkeyLearn’s API and Python SDK to start classifying text data with a machine learning model. 

With MonkeyLearn, you can either build a custom text classifier with your own tags and data or use one of the pre-trained models for text classification tasks. If you like the way a model work, you can find information on how to integrate it with Python in the API tab

For example, this is how you make an API request to this pre-trained model for sentiment analysis:

The API response for this request will look like this

Simple and straight-forward, right?

How to Get Started with Text Classification using Python?

The tools you use to create your classification model (SaaS or open-source) will determine how easy or difficult it is to get started with text classification. Besides choosing the right tool for training a text classifier, you’ll also need to make sure your datasets are up to scratch. 

In this section, we’ll cover the step by step process on how to train a text classifier with machine learning from scratch. Then, we’ll show you how you can use this model for classifying text programmatically with Python. Let’s get our hands dirty!

Data for Training a Model

Without clean, high-quality data, your classifier will not be accurate. 

A machine learning model is like a child: if you want to teach it something, you need to show it good examples first. But, how do you get good data to train your model? Well, you can choose to use internal data, that is, the information generated from the tools you use every day (Slack, Zendesk, Salesforce, SurveyMonkey, Retently, and so on). Most of the time, you’ll be able to get this data from APIs. Sometimes you can even download the data that you need in a CSV or Excel file. 

Alternatively, you can use external data that is available online for training a text classifier. To gather relevant information, you can scrape the web using BeautifulSoup or Scrapy, use APIs (e.g. Twitter API), or access public datasets. Some examples of the latter are: 

  • Reuters news dataset: Reuters compiled 21,578 news articles categorized into 135 topics. (such as Economy, Sports, and so on).
  • Amazon Product Reviews: Another useful dataset to train your model, which contains 143+ million reviews and star ratings. 
  • Spambase: a dataset with 4,601 emails labeled as spam.

With this data, you can start training your classifier to differentiate texts from one another. Below you’ll find out how to create and train your own classifier!

Creating Your Own Text Classifier

If you are looking for more accuracy and reliability when classifying your texts, then you should build your own model. This way you can teach the algorithm how you expect to classify texts using your own data and criteria. 

The easiest way to do this is using MonkeyLearn’s platform, as you won’t have to invest endless hours learning about machine learning or gathering all the resources you need to build a model from scratch. 

Sign up for free and let’s get started! Follow this step-by-step tutorial to create a text classifier for topic detection. This model will be able to predict the topic of a product review based on its content.

1. Create Your Classifier

To build a machine learning model using MonkeyLearn, you’ll have to access your dashboard, then click ‘create a model’, and choose your model type – in this case a classifier:

Then, you will have to choose a specific type of classifier. This time, choose topic classification to build your model:

2. Upload Your Dataset

The next step is to upload texts for training your classifier. We are going to upload a CSV file with reviews from a SaaS (you can download the dataset here). You can also use MonkeyLearn’s integrations to import your data from Google Sheets, Zendesk, Zapier, and more!

3. Define Your Tags

Classifiers will categorize your text data based on the tags that you define. In this example, we’ve defined the tags Pricing, Customer Support, and Ease of Use:

4. Start Tagging Data

Let’s start training the model! You’ll be asked to tag some samples to teach your classifier to categorize the reviews you uploaded. You’ll need around 4 samples of data for each tag before your classifier starts making predictions on its own:

5. Test it!

After tagging a certain number of reviews, your model will be ready to go! Now you need to test it. Just type something in the text box and see how well your model works:

And that’s it! Now you can start using your model whenever you need it. To improve its confidence and accuracy, you just have to keep tagging examples to provide more information to the model on how you expect to classify data.

6. Calling the Model API with Python

Now, let’s see how to ‘call’ your text classifier using its API with Python. It’s not that different from how we did it before with the pre-trained model:

The API response will return the result of the analysis:

Wrap-up 

Creating your own text classification model using open-source tools with Python is very useful but you’ll need machine learning skills, the resources, and infrastructure to build and run your classifier. Instead, you can use SaaS tools like MonkeyLearn, and you’ll only have to enter a few lines of code in Python to connect the machine learning model to various apps using the API. 

MonkeyLearn is very intuitive, simple, and scalable. You’ll need basic coding skills to use our API, but you can also create and use machine models without machine learning knowledge or background. All you’ll need is time to tag text samples manually but, believe us, it is worth the effort!
Get started today with text classification by signing up to MonkeyLearn for free, or request a demo to get more information. Our team is ready to answer all your questions and help you get started!

Federico Pascual

Federico Pascual

COO & Co-Founder @MonkeyLearn. Machine Learning. @500startups B14. @Galvanize SoMa. TEDxDurazno Speaker. Wannabe musician and traveler.

Notification

Have something to say?

Text Analysis with Machine Learning

Turn tweets, emails, documents, webpages and more into actionable data. Automate
business processes and save hours of manual data processing.