Is there any way to get higher accuracy?
Andreas August 5, 2015 at 1:09 pm
Hey guys! MonkeyLearn looks great! Loving the platform so far!
I’m build a classifier for my app and was wondering, is there any way to get higher accuracy? The results so far looks very good but would appreciate any tips!
These are some of the things you could do to improve the accuracy of your classifier:
Training Samples should be representative of the category
You should only use training samples that are really representative to the category and that provides learning value to the prediction model.
If a human reads the training sample and cannot directly associate the text to its category, you shouldn’t use it as a training sample.
For example, let’s imagine that we are building a classifier that does sentiment analysis of restaurant reviews with just 2 categories: ‘Good reviews’ and ‘bad reviews’.
It’s useless to add a training sample that says something like ‘does the menu includes a vegetarian option?’ because although its related to restaurants, a) it’s not a restaurant review b) it doesn’t have a ‘positive’ or ‘negative’ sentiment.
Quality of training samples
As mentioned in the last point, each sample that we upload to MonkeyLearn has to be representative of the category. It’s much better to start with fewer samples, but being 100% sure that those samples are really representative of each of your categories, than just add tons of data but with tons of noise.
Some of our users adds thousands of training samples at once (when are creating a custom classifier for the first time) thinking that the high volumes of data is really great for the machine learning algorithm, but by doing that, they don’t really pay attention to the data they upload as training samples. And most of the times many of those training samples are incorrectly tagged or categorized.
It’s like teaching history to a kid with a history book that has many facts that are plain wrong. The kid will ‘learn’ from this, but he will learn from really wrong information. He will definitely don’t know about history, no matter how much he reads and learns from this book. With a machine learning classifier its like the same situation.
So, it’s much better to start with just a few but quality training samples, but that are really representative of each category. And take it from there. Afterwards, you can work on improving the accuracy of this classifier by adding more quality data (the more the better).
Add more training data
Quantity of data, along side with quality of the data, is the most important thing to improve the accuracy of your classifier. The more training samples we add to our classifier, the better.
Basically, when we add more training samples to the classifier, we are giving the algorithm more information to learn for each category.
Overlapping of samples
Try to avoid using training samples that are ambiguous or have overlapping with other samples in other categories. There should be no doubt in which category a training sample should be placed, they have to be really representative of the category.
There are some training samples on your classifier that are really short and that doesn’t really help a machine learning algorithm to learn to categorize a new text.
For example, in the ‘feedback’ category I have seen some samples like “No!” or “Yeah!” that doesn’t help an algorithm (or even a human) to learn to identify a new text as feedback.
We recommend to use only the longer training samples (usually 4 or more words).
Depending on the classifier you are trying to build, the algorithm you should use. Sometimes Support Vector Machines is the way to go, but in some cases Multinomial Naive Bayes may perform better.
We recommend trying out both algorithms and see which one works better for your particular use case.
Check out the performance measures to see which categories may need more work with their training samples. You shouldn’t just see ‘accuracy’ but also ‘precision’ and ‘recall’ to really understand how well is working out a category.
Keyword cloud and keyword list
After you train your model, you should check the keyword cloud and the keyword lists to see which terms and words the algorithm has learned to associate for the different Futurama characters. Does those words make sense? Are there any words that should or shouldn’t be there? This is a great way to understand your machine learning model and to see what changes you should do within your categories.
A confusion matrix is a table that shows counts on how the test samples where classified for each category they actually belong. It gives you a quick idea on what categories need more work, It’s an excellent tool when you are in the process of cleaning up your samples.
Doing manual tests of your classifier, to detect what things was accurate and what things was inaccurate is one of the best ways to understand the strengths and weakness of your classifier and to understand which areas you should focus and work on your classifier.
Hope this helps. Please let me know how it goes :)
Does the number of samples for each category need to be also representative in number when using Bayes?
For example; I have two categories and one is 10x more frequent than the other…does it make a difference to have 10x samples to train Bayes well (thinking about the probability of the prior)?
Thanks – Luis
Hi Luis, good question.
If your are not using the parameter ‘normalize weights’ within your module, it does makes a difference in the probabilities in NB when you have two categories and one is 10x more frequent than the other.
Also, we usually suggest to balance the number of samples you use for each category. It’s only recommended to be disproportionate in the number of samples only when it’s ‘reflecting the reality’ (for example when category A is much much more common than category b).