Building a quality machine learning model for text classification can be a challenging process. You need to define the tags that you will use, gather data for training the classifier, tag your samples, among other things.
On this post, we will describe the process on how you can successfully train text classifiers with machine learning using MonkeyLearn. This process will be divided into five steps as follows:
What are the tags that you want to assign to your texts? This is the first question you need to answer when you start working on your text classifier.
Let's take a simple example: you want to classify daily deals from different websites. In this example, your tags for the various deals might be:
Now let's imagine that you are interested in sentiment analysis, you might want to have the following tags:
In contrast, if you are interested in classifying support tickets for an e-commerce site, you might want to define these tags:
Sometimes you know which are the tags you want to work with (for example if interested in sentiment analysis), but sometimes you don't know what tags you should use. In these cases, you need to first explore and understand your data to determine what are appropriate tags for your model.
A crucial part of this process is giving a proper structure and criteria to your tags. When you want to be more specific and use subtags, you will need to define a hierarchical tree that organizes your tags and subtags. Take into account that each set of subtags needs to be implemented on a separate classifier.
Going back to the example of classifying daily deals, you can organize your tags in the following way:
In this example, you will need to create a classifier for the first level of tags (Entertainment, Food & Drinks, Health & Beauty, Retail, Travel & Vacations and Miscellaneous) and a separate classifier for each particular subset of tags (e.g. Concerts, Movies, and Nightclubs)
When you define your tags, these are some of the things you need to take into consideration:
Use disjoint tags and avoid defining tags that are ambiguous or overlapping: there should be no doubt in which tag a text should be placed. Overlapping between your tags will confuse to your model and affect the accuracy of the predictions negatively.
Use one single classification criteria per model. Imagine that you want to tag companies based on their description. Your tags could be things like B2B, B2C, Enterprise, Finance, Media, Construction, etc. In this case, you should build two separate models: a) one to classify a company according to who are their customers (B2C, B2B, Enterprise) and b) another model to classify a company according to the industry vertical it operates (Finance, Media, Construction). Each model has its unique criteria and purpose.
Organize your tags according to their semantic relations. For example, Basketball and Baseball should be subtags of Sports because they are specific types of sports. Likewise, Clothing and Electronics should be a subset of tags of Retail. Therefore, if we want to implement a classification process that uses these tags, we'll need to create 3 classifiers: one that is able to classify between Sports and Retail, another classifier that classifies between the Sports subtags (basketball and baseball) and a third one that classifies between the Retail subtags (Clothing and Electronics). A classification process that has a clear structure can make a significant difference and will be a huge help to make accurate predictions with your classifiers.
If it's your first time training a text classifier, we recommend starting with a simple model. Complex models can take more effort in making them work well enough to make accurate predictions. Start with a small number of tags (<10).
When you get this simple model to work as expected, try adding a few more tags and work in your model until the new tags are accurate enough. Eventually, you can keep iterating adding more tags as you need.
Once you have defined your tags, the next step is to obtain text data, that is, the texts that you want to use as training samples and that are representative of future texts that you would want to classify automatically with your model.
The following sources are suggested to perform the data gathering:
You can use internal data, like files, documents, spreadsheets, emails, support tickets, chat conversations and more. You may already have this data in your databases or tools that you use every day:
Customer Support / Interaction:
CRMs:
Chat:
NPS:
Data Bases:
Data Analytics:
You usually have ways to export this data either by using an export function into CSV files or by using an API.
Data is everywhere and you can automatically get data from the web, by using web scraping tools, APIs and open data sets.
If you have coding experience, you can use a web scraping framework to build your scraper to get data from the web. These are some of the most used tools for web scraping:
Python:
Ruby:
Javascript:
PHP:
If you don't have coding experience, as an alternative you can use some of these visual tools where you can build a web scraper with just a few clicks:
Besides scraping, you can use APIs to connect with some websites or social media platforms to get the valuable data you need to train your machine learning classifier. For example, these are some useful APIs to obtain text data:
You can use open data from sites like Kaggle, Quandl, and Data.gov.
Tools like Zapier or IFTTT can be helpful for getting your text data, especially if you don't have coding experience. You can use them to connect to the tools that you use every day through the API but without coding :)
These are just some examples; you can find data everywhere, so consider using any particular tool or technique that you are familiar with.
After getting the data, you'll be ready to train a text classifier using MonkeyLearn. For this, you should follow these steps:
1. Create a new model and then click Classifier:
2. Import the text data using a CSV/Excel file with the data that you gathered:
3. Select the columns with the texts that you want to use for training the model:
4. Create the tags you will use for the classifier. You'll need at least two tags to get started, but you can add more later:
5. Tag each text that appears by the appropriate tag or tags. By doing this, you will be teaching the machine learning algorithm that for a particular input (text), you expect a specific output (tag):
You will need to label at least four text per tag to continue to the next step.
6. Name your model:
And ta-da! MonkeyLearn will train the classifier with the text data and tags you provided:
Now that the classification model is trained you can use it right away to classify new text. Under the "Run" tab you can test the model directly from the user interface:
You can also upload a CSV or Excel file with new data to process text in a batch all at once:
Or you can integrate it using our API or any of our integrations:
Under the Build Tab to see options to further improve the classifier. You can go to the Train section to tag more texts:
The amount of data that you need for your classifier strongly depends on your particular use case, that is, the complexity of the problem and the number of tags you want to use within your classifier.
For example, it's not the same to train a classifier for sentiment analysis for tweets than training a model to identify the topics of product reviews. Sentiment analysis is a much harder problem to solve and it needs much more text data. Analyzing tweets is also far more challenging than analyzing well-written reviews.
In short, the more text data that you have, the better. We suggest starting by tagging at least 20 samples per tag and take it from there. Depending on how accurate your classifier ends, add more data. For topic detection, we have seen some accurate models with 200-500 training samples in total. Sentiment analysis models usually need at least 3,000 training samples to start to start seeing an acceptable accuracy.
It's much better to start with fewer samples, but being 100% sure that those samples are representative of each of your tags and are correctly tagged than to add tons of data but with lots of errors.
Some of our users add thousands of training samples at once (when are creating a custom classifier for the first time) thinking that the high volumes of data is great for the machine learning algorithm, but by doing that, they don't pay attention to the data they use as training samples. And most of the times many of those samples are incorrectly tagged.
It's like teaching history to a kid with a history book that has many facts that are plain wrong. The kid will learn from this data, but he will learn from really wrong information. He will don't know about history, no matter how much he reads and learns from this book.
So, it's much better to start with few but high-quality training samples that are correctly tagged and take it from there. Afterward, you can work on improving the accuracy of your classifier by adding more quality data.
Once you have tagged enough text data, you can begin to see a series of metrics in the Stats area that show how well the classifier would predict new data. These metrics are key to understand your model and how you can improve it.
The accuracy is the percentage of samples that were predicted in the correct tag:
It's a metric that shows how well a classifier distinguishes between its tags. In the previous example, the classifier has an accuracy of 85% when distinguishing between its six tags (Entertainment & Recreation, Food & Drinks, Health & Beauty, Miscellaneous, Retail and Travel & Vacations).
Tips for improving the Accuracy:
Accuracy on its own is not a good metric; you also have to take care of precision and recall. You can have a classifier with outstanding accuracy but still have tags with bad precision and recall.
Precision and Recall are useful metrics to check the accuracy of each tag:
If a tag has low precision, it means that samples from other sibling tags were predicted as this tag, also known as false positives.
If a tag has a low recall, that means that samples from this tag were predicted as other sibling tags, also known as false negatives.
Usually, there's a trade-off between precision and recall in a particular tag, that means, if you try to increase precision, you could end up doing that at the cost of lowering recall, and vice versa.
By using the confusion matrix, you can see the false positives and false negative of your model.
Tips for improving Precision and Recall:
After selecting a tag, you can see its true positives, true negatives, false positives, and false negatives:
In the previous example, we can see four samples that initially weren't tagged as Food & Drinks were predicted to belong to this tag (false positives). On the other hand, we can see 24 samples that were originally tagged as Food & Drinks, weren't predicted to belong to Food & Drinks and were tagged into other tags (false negatives).
You can click these numbers to see the corresponding samples in the Samples section:
Here you can fix the problem as we described in the previous sections when the solution is to tag or retag samples:
You can see the keywords correlated to each tag by selecting the corresponding tag in the Stats tab:
Tips on improving the Keywords:
You can set special parameters in the classifier that affect its behavior and can improve the prediction accuracy considerably.
Tips on improving Parameters:
Machine learning is a powerful technology but to have an accurate model, you may need to iterate until you achieve the results you are looking for.
To achieve the minimum accuracy, precision and recall required, you will need to iterate the process from step 1 to 5, that is:
Refine your tags.
Gather more data.
Tag more data.
Upload the new data and retrain the classifier.
Test and improve:
This process can be done with two options:
Besides adding data, you can also improve your model by:
Text data is key in this process; if you train your algorithm with bad examples, the model will make plenty of mistakes. But if you can build a quality dataset, your model will be accurate and you will be able to automate the analysis of text data with machines.
January 31st, 2017