Building a quality machine learning model for text classification can be a challenging process. You need to build a training dataset, test different parameters for your model, fix the confusions, among other things.
On this post, we will describe the process on how you can successfully train text classifiers with machine learning using MonkeyLearn. This process will be divided into 5 major steps as follows:
- Defining your category tree,
- Data Gathering,
- Data Tagging,
- Training the Classifier,
- Testing & Improving the Classifier.
1- Define your Category Tree
What are the categories or tags that you want to assign to your texts? This is the first question you need to answer when you start working on your text classifier.
Choosing your categories
Let’s take a simple example: you want to classify daily deals from different websites. In this example, your categories for the different deals might be:
- Food & Drinks
- Health & Beauty
- Travel & Vacations
Now let’s imagine that you are interested in sentiment analysis, you might want to have the following categories:
In contrast, if you are interested in classifying support tickets for an e-commerce site, you might want to design a category tree that includes:
- Shipping issue
- Billing issue
- Product availability
- Discounts, Promo Codes and Gift Cards
Sometimes you know which are the categories you want to work with (for example if interested in sentiment analysis) but sometimes you don’t know what categories you should use. In these cases, you need to first explore and understand your data in order to determine what are appropriate categories for your model.
Structure your categories
A key part of this process is giving a proper structure to your categories. When you want to be more specific and use sub-categories, you will need to define a hierarchical tree that organizes your categories and sub-categories.
Going back to the example of classifying daily deals, you can organize the category tree in the following way:
- Food & Drinks
- Health & Beauty
- Hair & Skin
- Spa & Massage
- TV & Video
- Travel & Vacations
- Flight Tickets
Tips for your category tree
When you design your category tree, these are some of the things you need to take into consideration:
Organize your categories according to their semantic relations. For example, Basketball and Baseball should be sub-categories of Sports because they are specific types of sports. A category tree that has a good structure can make a great difference and will be a huge help to make accurate predictions with your classifier.
- Avoid overlapping
Use disjoint categories and avoid defining categories that are ambiguous or overlapping: there should be no doubt in which category a text should be placed. Overlapping between your categories will cause confusion to your model and affect negatively the accuracy of the predictions.
- Don’t mix classification criteria
Use one single classification criteria per model. Imagine that you want to categorize companies based on their description. Your categories could be things like B2B, B2C, Enterprise, Finance, Media, Construction, etc. In this case, you should build two separate models: a) one to classify a company according to who are their customers (B2C, B2B, Enterprise) and b) another model to classify a company according to the industry vertical it operates (Finance, Media, Construction). Each model has its own criteria and purpose.
- Start small and then go big
If it’s your first time training a machine learning classifier, we recommend starting with a simple model. Complex models can take more effort in making them work well enough to make accurate predictions. Start with a small number of categories (<10) and up to 2 levels of categories.
When you get this simple model to work as expected, try adding a few more categories or adding a third level of categories. Eventually, you can keep iterating adding more categories as you need.
2- Data Gathering
Once you have defined your category tree, the next step is to obtain training data, that is, the texts that you want to use as training samples and that are representative of future texts that you would want to classify automatically with your model.
The following sources are suggested to perform the data gathering:
You can use internal data, like files, documents, spreadsheets, emails, support tickets, chat conversations and more. You may already have this data in your own databases or tools that you use every day:
Customer Support / Interaction:
- Hubspot CRM
You usually have ways to export this data either by using an export function into CSV files or by using an API.
Data is everywhere and you can automatically get data from the web, by using web scraping tools, APIs and open data sets.
Web scraping frameworks
If you have coding experience, you can use a web scraping framework to build your own scraper to get data from the web. These are some of the most used tools for web scraping:
Visual web scraping tools
If you don’t have coding experience, as an alternative you can use some of these visual tools where you can build a web scraper with just a few clicks:
Besides scraping, you can use APIs to connect with some websites or social media platforms to get the valuable data you need to train your machine learning classifier. For example, these are some useful APIs to get training data:
- New York Times
- The Guardian
Tools like Zapier or IFTTT can be very helpful getting your training data, especially if you don’t have coding experience. You can use them to connect to the tools that you use every day through the API but without coding 🙂
These are just some examples; you can find data everywhere, so consider using any particular tool or technique that you are familiar with.
3- Data Tagging
After getting the data, you’ll have to tag the data into the corresponding categories to create a training dataset for teaching your classifier.
This is can be a manual process but it’s key for training your model. By tagging the data, you will be teaching the machine learning algorithm that for a particular input (text), you expect a particular output (category).
An accurate machine learning classifier depends on how accurate is this initial tagging you do for your training dataset.
Data Tagging Tools
The following tools are suggested to perform data tagging:
- Using the MonkeyLearn’s GUI in the Samples section after creating and uploading the data (see next section for details).
- Excel / Libre Office / Google Spreadsheets.
- Open Refine.
- Any particular tool or technique that you are familiar with.
How much training data do we need?
The number of training samples that you need for your model strongly depends on your particular use case, that is, the complexity of the problem and the number of categories you want to use within your model.
For example, it’s not the same to train a model for sentiment analysis for tweets than training a model to identify the topics of product reviews. Sentiment analysis is a much harder problem to solve and it needs much more training data. Analyzing tweets is also far more challenging that analyzing well-written reviews.
In short, the more training data that you have the better. We suggest starting by tagging at least 20 samples per category and take it from there. Depending on how accurate your classifier ends, add more data. For topic detection, we have seen some accurate models with 200~500 training samples in total. Sentiment analysis models usually need at least 3,000 training samples in order to start to start seeing an acceptable accuracy.
Quality over quantity
It’s much better to start with fewer samples, but being 100% sure that those samples are really representative of each of your categories and are correctly tagged, than to just add tons of data but with lots of errors.
Some of our users add thousands of training samples at once (when are creating a custom classifier for the first time) thinking that the high volumes of data is really great for the machine learning algorithm, but by doing that, they don’t really pay attention to the data they use as training samples. And most of the times many of those samples are incorrectly tagged.
It’s like teaching history to a kid with a history book that has many facts that are plain wrong. The kid will learn from this data, but he will learn from really wrong information. He will definitely don’t know about history, no matter how much he reads and learns from this book.
So, it’s much better to start with few but high-quality training data that is correctly tagged and take it from there. Afterwards, you can work on improving the accuracy of your classifier by adding more quality data.
Saving the tagged data
In order to use the training samples within MonkeyLearn, the data shall be saved in a CSV or Excel file with 2 columns:
- 1 column for the text (input),
- 1 column for the category (expected output).
Finally, you will need to upload the CSV / Excel file to MonkeyLearn, so you can train your model. You can learn more about CSV / Excel format here.
Eventually, if you have coding skills and feel more comfortable, you can instead use MonkeyLearn’s API to upload your data after you created your classifier.
4- Training the Classifier
After gathering and tagging the data, you will be ready to train a classifier using MonkeyLearn. For this, you should follow these steps:
- Create a new classifier using MonkeyLearn UI. In this step, you will be asked for some information related to your project via a creation wizard.
- Upload the training data using the CSV/Excel file with the data that you gathered and tagged.
- Train your classifier: it will take a few seconds or minutes depending on the complexity of the category tree and amount of data uploaded. You’ll see a progress bar on the top.
5- Testing & Improving the Classifier
After the model is trained, you will see a series of metrics in the Statistics area that show how well the classifier would predict new data. These metrics are key to understand your model and how you can improve it.
You can see examples on of these metrics in this public Deals Classifier module.
The accuracy is the percentage of samples that were predicted in the correct category:
It’s a metric that shows how well a parent category distinguishes between its children categories. In the previous example, the Root category has an accuracy of 80% when distinguishing between its 6 children (Entertainment & Recreation, Food & Drinks, Health & Beauty, Miscellaneous, Retail and Travel & Vacations).
Tips on improving the Accuracy:
- Add more training samples to its children categories.
- Retag samples that might be incorrectly tagged into the children categories (see confusion matrix section below).
- Sometimes sibling categories could be too ambiguous. If possible, we recommend merging those categories.
Accuracy on its own is not a good metric, you also have to take care of precision and recall. You can have a classifier with very good accuracy but still have categories with bad precision and recall.
Precision and Recall
Precision and Recall are useful metrics to check the accuracy on each child category:
If a child category has low precision, it means that samples from other sibling categories were predicted as this child category, also known as false positives.
If a child category has a low recall, that means that samples from this child category were predicted as other sibling categories, also known as false negatives.
Usually, there’s a trade-off between precision and recall in a particular category, that means, if you try to increase precision, you could end up doing that at the cost of lowering recall, and vice versa.
By using the confusion matrix, you can see the false positives and false negative of your model.
Tips on improving Precision and Recall:
- By using the confusion matrix, you can explore the false positives and false negatives of your model.
- If a sample was initially tagged as child category X but was correctly predicted as child category Y, move that sample to children category Y.
- If the sample was incorrectly predicted as child category Y, try to make the classifier learn more about that the difference by adding more samples both to category X and category Y.
- Check that the keywords associated with child categories X and Y are correct (see Keyword Cloud section to see how to fix that).
After selecting a parent category in the category tree, you can see the confusion matrix which shows the confusion between the actual category and the predicted category for its children categories:
In the previous example, we can see that 4 samples that were tagged as Travel & Vacations were incorrectly predicted as Entertainment (red 4 in the left bottom corner of the matrix).
You get perfect results if you obtain a confusion matrix that has non-zero numbers only in its diagonal.
You can click that particular number to see the corresponding samples in the Samples section:
Here you can fix the problem as we described in the previous sections when the solution is to tag or retag samples:
- You can select samples in the left checkbox or use the shortcut X or Space keys.
- You can paginate by using the left and right arrow keys.
- You can delete or move samples to categories by using the Actions menu after selecting the corresponding samples.
- See shortcuts by hitting Ctrl + h.
You can see the keywords correlated to each category by selecting the corresponding category in the Tree section:
Tips on improving the Keywords:
- Check if the keywords that were used to represent samples (dictionary) correlated to each category make sense.
- Discover keywords that should not be in the dictionary or should not be correlated with that particular category.
- You can see a more detailed list of keywords and their relevance by clicking the Keyword List link below the keyword cloud.
- You can click a particular keyword (either in the cloud or the list) to filter the samples that match with that particular keyword.
- Filter undesired keywords by adding the particular string into the stopwords list (see Parameters section below).
- If a keyword that is useful to represent your category is missing from your list of keywords, try adding more data to your model that uses that specific term.
You can set special parameters in the classifier that affect its behavior and can improve considerably the prediction accuracy.
Tips on improving Parameters:
- Add keywords to the stopword list if you want to avoid them to be used as keywords by the classifier.
- Use Multinomial Naive Bayes when developing the classifier as it gives you more insights on the predictions and debugging information. You should switch to Support Vector Machines when finishing developing the classifier to get some extra accuracy.
- Enable stemming (to transform words into its roots) when useful for your particular case.
- Try increasing the max features parameter to maximum 20,000.
- Don’t filter default stopwords if you’re working with sentiment analysis.
- Enable Preprocess social media when working with social media texts like tweets or Facebook comments.
Machine learning is a really powerful technology but in order to have an accurate model, you may need to iterate until you achieve the results you are looking for.
In order to achieve the minimum accuracy, precision and recall required, you will need to iterate the process from step 1 to 5, that is:
- Refine your Category tree.
- Gather more data.
- Tag more data.
- Upload the new data and retrain the classifier.
- Test and improve:
- Metrics (accuracy, precision and recall).
- False positives & false negatives.
- Confusion matrix.
- Keyword cloud and keyword list.
This process can be done with two options:
- Manually tagging the additional data.
- Bootstrapping, that is, use the currently trained model to classify untagged samples and then verify that the prediction is correct. Usually verifying the tags is easier (and faster) than manually tagging them from scratch.
Besides adding data, you can also improve your model by:
- Fixing the confusions of your model,
- Improving the keywords of your categories,
- Finding the best parameters for your use case.
At the end of the day, training data is key in this process; if you train your algorithm with bad examples, the model will make plenty of mistakes. But if you are able to build a quality dataset, your model will be accurate and you will be able to automate the analysis of text data with machines.