Document Classification

Document classification is the act of labeling – or tagging – documents using categories, depending on their content. Document classification can be manual (as it is in library science) or automated (within the field of computer science), and is used to easily sort and manage texts, images or videos.

Both types of document classification have their advantages and disadvantages. On the one hand, classifying documents manually gives humans greater control over the process of classification, and they can make decisions as to which categories to use. However, when handling large volumes of documents, this process can be slow and monotonous.

Instead, it is much faster, more cost-efficient, and more accurate, to carry out automatic document classification, powered by machine learning.

Document Classification Vs Text Classification

Even though these terms sound similar, they’re slightly different. Text classification involves classifying text by performing text analysis techniques on your text-based documents.

With text classification, you can also analyze texts at different levels:

  • Document-level: you will obtain relevant information for a full document.
  • Paragraph level: obtains the most important categories of just one paragraph.
  • Sentence level: obtains relevant information of a single sentence.
  • Sub-sentence level: obtains relevant information of sub-expressions within sentences (also known as opinion units). This is particularly useful when there are ambiguous sentences that mention multiple topics. 

For example, you can run topic classification on a whole article to get a general picture of what the article talks about, or you can pre-process that text to divide it into paragraphs, sentences, or even opinion units to get more in-depth insights.

So, which one is better? Should you analyze your documents as a whole or break them into smaller units? Unfortunately, there is no straight answer. Your choice will depend on your data and objectives.

How Does Automatic Document Classification Work?

CClassifying large volumes of documents is essential to make them more manageable and, ultimately, obtain valuable insights. But human agents might find the incoming volume of data very hard to manage, not to mention tedious and inefficient.

That is why automatic document classification is a great option. Using Natural Language Processing (NLP) and machine learning algorithms, you can automatically assign one or more categories to huge amounts of text. Machine learning tools are faster, scalable, and less biased than manual classification because machines never get tired, bored, or change their criteria over time.

Let’s take a look at three different approaches to document classification you can adopt: 

  • Supervised: In this method, you’ll need to define a set of tags (let’s say, Customer Service, Usability, Pricing) and manually tag a number of texts before machine learning models can start making predictions on their own. For example, a customer review that says “the software is quite expensive” needs to be tagged as Pricing. The more texts you classify the better the confidence of the model.

  • Unsupervised: In this method, documents containing similar words or sentences will be grouped together by a classifier without any prior training. For example, the words RAM, SSD, or Printer in customer reviews would be recognized as sharing similar qualities and grouped within the same cluster.

  • Rules-based: This method is based on linguistic rules that give instructions to models. Following these rules and patterns, which are based on morphology, lexis, syntax, semantics, and phonology, models will automatically tag your texts. These rules.

For example: 

(Update|OS|Bugs) → Software

Following the rule above, the model will tag any text that mentions these terms as Software.

The main advantage of this method is that the performance of the model is constantly improving, providing higher quality and more accurate insights over time. On the negative side, creating this type of system is complex, time-consuming, and hard to scale. You would have to add new rules or change existing ones every time you need to analyze a new type of text.

Why Use Document Classification?

Businesses are drowning in unstructured data and the only way to make sense of it all is using AI tools This is where automatic document classification can help:

  • Triaging: document classification comes in handy to automatically sort articles or texts and route them to a relevant team. For example, let’s say you work for a software company and that you use document classification to tag incoming support tickets. You can define that new tickets labeled as Bug should be automatically routed to the technical team.
  • Identification: automated classification can help identify the language, genre or topic – for example texts that are suitable for different age groups, or interests.
  • Analytics: automated classification can be used for monitoring important information, for example comments related to public health on social media or problems with your service or product. 

Getting Started with Document Classification powered by AI

For automated document classification, there are two steps you’ll need to go through: preparing the dataset and training the algorithm. Let’s take a look at them in detail:

1. Gather your dataset

This is the most important element you’ll need to gather for training your classifier. The dataset needs to contain enough documents or examples for each category so that the algorithm can learn how to differentiate between them.

For example, if you want to classify documents into five categories, for training a classifier you would need at least 100-300 documents per category to achieve decent predictive capabilities. So, the total number of documents within the dataset for training this classifier would be at least 500. Keep in mind that the more data you use, the more accurate the classifier will be.

Moreover, the quality of the data is critical when training a classifier with machine learning. If most of the examples that you fed the classifier are incorrectly tagged, the model will learn from these mistakes and will commit similar errors whenever making predictions.

2. Training the Algorithm 

Once you have the data to train your model, the next step is to use that data to train a classification algorithm.  There are many complex algorithms you can use if creating a classifier from scratch, for example Naive Bayes and Support Vector Machines.

If you know how to code, you can use open source tools such as scikit-learn, SpaCy, or TensorFlow to train these algorithms to classify your documents, but you’ll need to have some basic knowledge in machine learning and build the necessary infrastructure from scratch.

On the other hand, there are some platforms like MonkeyLearn that makes it a lot easier to train your classifier with machine learning. You just need to upload your data (in the form of an Excel or CSV file), define your tags, and classify some documents by hand using a simple user interface to train your classifier. And that’s it! After tagging a certain number of texts, your classifier will be ready for production.

You can use a trained model in MonkeyLearn to classify new documents by uploading data in a batch, using one of the available integrations with third-party tools (such as Google Sheets or Zapier) or via the API.

Watch this tutorial to get to know more about how to build your own document classifier in a very simple way. 

Wrap Up

Documents are some of the richest sources of information for any business. Be it articles, customer surveys, or support tickets, all of them contain valuable insights. The best way to get to these insights is by classifying all the data you receive so you can start making sense of them.

Manual classification of documents can be a nightmare, especially if the volume of information is high. In this scenario, labeling documents becomes repetitive and human agents are likely to make mistakes.

Document classification is much more efficient, cost-effective, and accurate when done by machines.

Save yourself the hassle of manual analysis and start using machine learning for effective document classification. There are many classification tools available that make it super easy to start using AI for document classification; some of these tools don’t even need you to write a single line of code.

MonkeyLearn, for example, provides pre-trained classification models that you can get started with right away in an easy-to-use interface. Additionally, you can integrate it with applications you use on a daily basis to efficiently classify your documents in seconds. Sign up to Monkeylearn for free and get started with document classification right away!

Federico Pascual

November 21st, 2019