In this article, we will explore the advantages of using support vector machines in text classification and will help you get started with SVM-based models with MonkeyLearn.
Support vector machines is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it.
It can be applied to any kind of vectors which encode any kind of data. This means that in order to leverage the power of svm text classification, texts have to be transformed into vectors.
Now, what are vectors?
Vectors are (sometimes huge) lists of numbers which represent a set of coordinates in some space.
So, when SVM determines the decision boundary we mentioned above, SVM decides where to draw the best “line” (or the best hyperplane) that divides the space into two subspaces: one for the vectors which belong to the given category and one for the vectors which do not belong to it.
So, provided we can find vector representations which encode as much information from our texts as possible, we will be able to apply the SVM algorithm to text classification problems and obtain very good results.
Say, for example, the blue circles in the graph below are representations of training texts which talk about the Pricing of a SaaS Product and the red triangles are representations of training texts which do not talk about that. What would the decision boundary for the Pricing category look like?
The best decision boundary would look like this:
Now that the algorithm has determined the decision boundary for the category you want to analyze, you only have to obtain the representations of all of the texts you would like to classify and check what side of the boundary those representations fall into.
Creating a text classifier using SVM is easy and straightforward with MonkeyLearn, a no-code text analysis solution.
Sign up for free and get started.
Click on create a model. You will be prompted to choose the model type you would like to create. Let’s choose Classifier:
Now, you will have to choose the type of classification task you would like to perform. In this mini tutorial, we are going to show you how to create a model to classify the topics being dealt with in texts from hotel reviews, so let’s choose Topic Classification. However, bear in mind that text classification using SVM can be just as good for other tasks as well, such as sentiment analysis or intent classification:
Now it’s time to import your data:
Once we’ve chosen our CSV file with the sample dataset, a screen like the one below will appear with a preview of the data, let’s click Continue:
The next step is to define the tags we want to use in our classifier. Let’s define a few tags like Location, Comfort & Facilities, and Staff:
Now, it’s time to tag data to train our classifier. By tagging some examples, SVM will learn that for a particular input (text), we expect a particular output:
Once you have finished taking care of your training data, you will have to name your classifier before you can keep training it, start using it, or change its settings. Type some descriptive name in the textbox and click Finish.
Since MonkeyLearn uses SVM as the default classification algorithm, you won’t need to change your classifier’s advanced settings at this point unless you would like to make some other adjustments. Just give it a try, go to Run and try it out. Here’s an SVM text classification example from hotel review:
Chances are that some results are not as good as you expect, especially if you have not uploaded a lot of training data. Don’t worry! If this happens, go to Build > Data and try uploading more data, tagging it, and trying the classifier again until you get the results you expect.
Using SVM classifiers for text classification tasks might be a really good idea, especially if the training data available is not much (~ a couple of thousand tagged samples).
Monkeylearn makes it really simple and straightforward to create text classifiers. Within minutes, you'll get great new insights from your data.
Automate business processes and save hours of manual data processing.