Document Classification

Both types of document classification have their advantages and disadvantages. On the one hand, classifying documents manually gives humans greater control over the process of classification, and they can make decisions as to which categories to use. However, when handling large volumes of documents, this process can be slow and monotonous. Instead, it is much faster, as well as more cost-efficient and accurate, to carry out automatic document classification, that is, powered by machine learning

Document Classification or Text Classification?

Even though these terms sound similar, they’re slightly different… Text classification involves classifying text by performing specific techniques on your text-based documents, such as sentiment analysis, topic labeling, and intent detection. Plus, when analyzing texts, it is possible to do so at different levels.

For example, you can run topic classification on a whole article to get a general picture of what the article talks about, or you can pre-process that text to divide it into paragraphs, sentences, or even opinion units to get more in-depth insights. 

Text analysis can be performed at:

  • Document-level: you will obtain relevant information for a full document.
  • Paragraph level: obtains the most important categories of just one paragraph.
  • Sentence level: obtains relevant information of a single sentence.
  • Sub-sentence level: obtains relevant information of sub-expressions within sentences (also known as opinion units). This is particularly useful when there are ambiguous sentences that mention multiple topics. 

So, which one is better? Should you analyze your documents as a whole or break them into smaller units? Unfortunately, there is no straight answer. Your choice will depend on your data and objectives.

How Does Automatic Document Classification Work?

Classifying large volumes of documents is essential to make them more manageable and, ultimately, obtain valuable insights. But human agents might find the incoming volume of data very hard to manage, not to mention tedious and inefficient. 

That is why automatic document classification comes in handy. This is a process fueled by Natural Language Processing (NLP), by which algorithms automatically assign one or more categories to your text-based documents such as articles, emails, or survey responses. Using machine learning models is faster, more scalable, and less biased than manual classification because machines never get tired, bored, or change their criteria over time.

Let’s take a look at three different approaches to document classification you can adopt: 

  • Supervised: In this method, machine learning models need you to manually tag a number of texts before they can start making predictions on their own. So, this means that first you will have to define a set of tags (let’s say, Customer Service, Usability, Pricing) that you will later use to classify your documents by hand before the model can do it on its own. From these examples, the model will learn to make associations between the texts and the expected tags. For example, a customer review that says “the software is quite expensive” needs to be tagged as Pricing. The number of texts you classify will also influence the confidence of the model.
  • Unsupervised: With this method, documents containing similar words or sentences will be grouped together by a classifier without any prior training. For example, the words RAM, SSD, or Printer in customer reviews would be recognized as sharing similar qualities and grouped within the same cluster. 
  • Rules-based: As its name indicates, this method is based on linguistic rules that give instructions to the model, which will automatically tag your texts following these patterns. These rules are based on morphology, lexis, syntax, semantics, and phonology

For example: 

(Update|OS|Bugs) → Software

Following the rule above, the model will tag any text that mentions these terms as Software.

The main advantage of this method is that it’s constantly improving the performance of the model, so it provides higher quality, more accurate insights. On the negative side, creating this type of system is complex, time-consuming, and hard to scale. You would have to add new rules or change existing ones every time you need to analyze a new type of text. 

Why Use Document Classification?

Today, businesses are overwhelmed with the amount of information they receive, such as articles, survey responses, or support tickets. These texts are not structured, so it’s hard to understand the insights they contain. This is where automatic document classification can help: 

  • Email management: automated software can filter spam or route emails to a specific inbox, for example, based on the words included in the message.
  • Sentiment Analysis: you can classify your documents by determining their overall sentiment (Positive, Negative, or Neutral).
  • Triaging: document classification comes in handy to automatically sort articles or texts and route them to a relevant team. For example, let’s say you work for a software company and that you use document classification to tag incoming support tickets. You can define that new tickets labeled as Bug should be automatically routed to the technical team.
  • Identification: automated classification can help identify the language, genre or topic – for example texts that are suitable for different age groups, or interests.
  • Analytics: automated classification can be used for monitoring important information, for example comments related to public health on social media or problems with your service or product. 

Getting Started with Document Classification powered by AI

For automated document classification, there are two steps you’ll need to go through: preparing the dataset and training the algorithm. Let’s take a look at them in detail:

1. The dataset

This is the most important element you’ll need to gather for training your classifier. The dataset needs to contain enough documents or examples for each category so that the algorithm can learn how to differentiate between them.

For example, if you want to classify documents into five categories, for training a classifier you would need at least 100-300 documents per category to achieve decent predictive capabilities. So, the total number of documents within the dataset for training this classifier would be at least 500. Keep in mind that the more data you use, the more accurate the classifier will be.

Moreover, the quality of the data is critical when training a classifier with machine learning. If most of the examples that you fed the classifier are incorrectly tagged, the model will learn from these mistakes and will commit similar errors whenever making predictions.

2. Training the Algorithm 

Once you have the data to train your model, the next step is to use that data to train a classification algorithm.  There are many complex algorithms you can use if creating a classifier from scratch, for example Naive Bayes and Support Vector Machines.

If you know how to code, you can use open source tools such as scikit-learn, SpaCy, or TensorFlow to train these algorithms to classify your documents, but you’ll need to have some basic knowledge in machine learning and build the necessary infrastructure from scratch.

On the other hand, there are some platforms like MonkeyLearn that makes it a lot easier to train your classifier with machine learning. You just need to upload your data (in the form of an Excel or CSV file), define your tags, and classify some documents by hand using a simple user interface to train your classifier. And that’s it! After tagging a certain number of texts, your classifier will be ready for production.

You can use a trained model in MonkeyLearn to classify new documents by uploading data in a batch, using one of the available integrations with third-party tools (such as Google Sheets or Zapier) or via the API.

Watch this tutorial to get to know more about how to build your own document classifier in a very simple way. 

Wrap Up

Documents are some of the richest sources of information for any business. Be it articles, customer surveys, or support tickets, all of them contain valuable insights. The best way to get to these insights is by classifying all the data you receive so you can start making sense of them.

Manual classification of documents can be a nightmare, especially if the volume of information is high. In this scenario, labeling documents becomes repetitive and human agents are likely to make mistakes. 

That’s when machine learning comes to the rescue. Document classification is much more efficient, cost-effective, and accurate when done by machines. Save yourself the hassle of manual analysis and start using machine learning for effective document classification! There are many classification tools available that make it super easy to start using AI for document classification; some of these tools don’t even need to write a single line of code.

MonkeyLearn, for example, can help you achieve your goals with its easy-to-use interface and customizability. Additionally, you can integrate it with applications you use on a daily basis to efficiently classify your documents in seconds. Sign up for free to MonkeyLearn and get started with document classification right away!

Federico Pascual

Federico Pascual

COO & Co-Founder @MonkeyLearn. Machine Learning. @500startups B14. @Galvanize SoMa. TEDxDurazno Speaker. Wannabe musician and traveler.


Have something to say?

Text Analysis with Machine Learning

Turn tweets, emails, documents, webpages and more into actionable data. Automate
business processes and save hours of manual data processing.