Text classification can be used for a wide variety of tasks such as sentiment analysis, topic detection, intent identification, and much more. But when it comes to classification, many often ask whether it’s better to analyze documents as a whole, or if it’s more convenient to preprocess these documents and divide them into smaller units before doing the analysis. Unfortunately, there is not a one-size-fits-all answer. Which approach is more appropriate for classification depends on your data and your goals with the analysis.
Broadly speaking, there are four different levels of scope that can be applied in text classification:
- Document level obtains the relevant categories of a full document.
- Paragraph level obtains the relevant categories of a single paragraph.
- Sentence level obtains the relevant categories of a single sentence.
- Sub-sentence level obtains the relevant categories of sub-expressions within a sentence (also known as opinion units).
Splitting text into smaller chunks can be useful to provide granular results. Let’s say that you want to do sentiment analysis on customer feedback, for example:
"The user interface is quite straightforward and easy to use, but your documentation is super confusing"
From a text classification perspective, this expression is quite complex as it has multiple opinions within the text. Not only will a sentiment analysis classifier struggle with this utterance, even humans will disagree with it’s classification: some will say its a positive statement, others will say it's mostly negative, a few probably will say its neutral (evening out both polarities) and even some will say it should be classified as both positive and negative.
In this case, a sub-sentence level scope can reduce the utterance ambiguity and increase the agreement in the classification result. You can easily split a text into smaller units by using the Opinion Unit Extractor available on MonkeyLearn. In this example, the extractor will then return 2 separate opinion units:
- Opinion 1: The user interface is quite straightforward and easy to use,
- Opinion 2: but your documentation is super confusing
Then, you can classify each opinion unit separately to get more granular results. Following our example above, this pre-trained model for sentiment analysis returns the following results:
- "The user interface is quite straightforward and easy to use," --> Tag: Positive. Confidence: 99.6%.
- "but your documentation is quite confusing" --> Tag: Negative. Confidence: 75%.
Splitting feedback into opinion units also makes it easier to map the results of different classifiers. For example, you might also be interested in knowing what things people like or dislike about your brand or product. You can achieve this by means of aspect-based sentiment analysis by combining the results of an aspect classifier with the results of a sentiment classifier. In the example above, the first opinion unit would be tagged as UX and Negative, and the second opinion would be tagged as Documentation and Negative.
Alternatively, the paragraph level scope can also be quite useful. For example, let’s imagine that you want to analyze Statement of Work (SOW) contracts and that your goal is to identify the pieces of text that are talking about different topics such as Requirements, Scope & Schedule, Cost Structure, and Terms and Conditions. In this case, you’ll get better results if you analyze the document at a paragraph level since content in legal documents is properly structured and well defined (unlike customer feedback that can be quite messy). In this case, the granularity provided by the sentence or sub-sentence levels is not needed.
There are some cases where you should analyze the whole document without any trimming or splitting. For example, when classifying documents such as publications, news reports, and other media articles into topics such as Sports, Politics or Technology, you’ll get better results using the full-text version of the documents since this approach provides more information and more context to the classification algorithm. It also provides better word co-occurrence for finding discriminative features which help the algorithm to find relevant categories for the content.
Experiment with Different Scopes
Experimentation is an important part of finding out which scope is the most appropriate approach for a text classification task. You can leverage MonkeyLearn to quickly train text classifiers with different scopes and find out which one is better for your use case and data.
Then, import the text data you want to you want to use for training your classifier:
Next, you’ll need to define the tags or categories you want to use in your classifier:
Finally, you’ll need to tag data with the appropriate categories to start training the model:
Once you have finished tagging data, you will be able to test the classifier by using the UI:
Or by uploading a CSV or Excel file to test data in a batch: