Data is everywhere. And in massive quantities. We are currently in an era of data explosion, where millions of tweets, articles, comments, reviews and the like are being published everyday.

Developers are taking advantage of the abundance of data and using things like web scraping to do all kinds of cool things. Sometimes web scraping is not enough; digging deeper and analyzing the data is often needed to unlock the true meaning behind the data and discover valuable insights.

On this tutorial we will cover how you can use MonkeyLearn and Scrapy to build a machine learning model that will help you analyze vast amounts of web scraped data in a cost-effective way.

Getting started

We will use Scrapy to extract hotel reviews from TripAdvisor and use those reviews as training samples to create a machine learning model with MonkeyLearn. This model will learn to detect if a hotel review is positive or negative and will be able to understand the sentiment of new and unseen hotel reviews.

1. Create a Scrapy spider

The first step is to scrape hotel reviews from TripAdvisor by creating a spider:

New to Scrapy?
If you have never used Scrapy before, visit this article. It’s very powerful yet easy to use, and will allow you to start building web scrapers in no time.

Choose the data you want to scrape with Scrapy
In this tutorial we will use New York City hotel reviews to create our hotel sentiment analysis classifier. In our case we will extract the review title, the review content and the stars:

one_review_full

Why the stars? In order to train MonkeyLearn modules, we need data that is already tagged, so the algorithm knows how a positive or a negative review actually looks like. Luckily the reviewers were kind enough to provide us with this information, in the form of stars.

To save the data, we will define a Scrapy item with three fields: “title”, “content” and “stars”:

We also create a spider for filling in these items. We give it the start URL of the New York Hotels page.

Then, we define a function for parsing a single review and saving its data:

Afterwards, we define a function for parsing a page of reviews and then passing the page. You’ll notice that on the reviews page we can’t see the the whole review content, just the beginning. We will work around this by following the link to the full review and scraping the data from that page using parse_review :

Finally, we define the main parse function, which will start at the New York hotels main page, and for each hotel it will parse all its reviews:

So, to review: we told our spider to start at the New York hotels main page, follow the links to each hotel, follow the links to each review, and scrape the data. After it is done with each page it will get the next one, so it will be able to crawl as many reviews as we need.

You can view the full code for the spider here.

2. Getting the data

So we have our Scrapy spider created, we are ready to start crawling and gathering the data.

We tell it to crawl with scrapy crawl tripadvisor -o scrapyData.csv -s CLOSESPIDER_ITEMCOUNT=10000

This will scrape 10,000 TripAdvisor New York City hotel reviews and save them in a CSV file named scrapyData.csv . With that many reviews, it may take a while to finish. Feel free to change the amount if you need.

3. Preparing the data

So we generated our scrapyData.csv file, now it’s time to preprocess the data. We’ll do that with Python and the Pandas library.

First we import the csv file into a data frame, remove duplicates, drop the reviews that are neutral (3 of 5 stars):

Then we create a new column that concatenates the title and the content:

Then we create a new column that will be what we want to predict: Good or Bad, so we transform reviews with more than 3 stars into Good, and reviews with less than 3 stars into Bad:

We’ll keep only the full_content and true_category columns:

If we take a look at the data frame we created it may look something like this:

data_table

To have a quick overview of the data, we have 4,913 Good reviews and 4,501 Bad reviews:

This looks about right. If you have too few reviews of a category (for instance, 9,000 Good and 1,000 Bad), it could have a negative impact in the training of your module. To fix this, scrape more bad reviews: run the spider again, for a longer time, then get only the bad reviews and mix them with the data you already have. Or you could find hotels with mostly bad reviews and scrape those.

Finally, we have to save our dataset as a CSV or Excel file so we can upload it to MonkeyLearn to train our classifier. To train our model we only need the content of the reviews and categories, so we remove the headers and the index column. We also encode the file in UTF-8:

4. Creating a text classifier

Ok, now it’s time to move to MonkeyLearn. We want to create a text classifier that classifies reviews into two possible categories ‘Good’ or ‘Bad’, so we can programmatically detect if a review is talking in a positive or negative way respectively. That process is known as Sentiment Analysis, that is, extracting the mood from a text.

First you have to sign up for Monkeylearn, and after you log in you will see the main dashboard. MonkeyLearn has public modules created by the MonkeyLearn community trained for specific tasks, but it also allows you to create your own custom module to fit your needs. In our case, we will build a custom text classifier, so click the Create Module button:

monkey1

Now we will be prompted to fill the initial settings. First we name and describe our module. Then we set the permissions to Public and the Module Type to Classifier, since we are building a text classifier. When you are finished, click next:

monkey2

Now we will be asked to define some information about our module, in order to fit it to our needs. In ‘What are you working on?’ choose Web scraping and in ‘What are you going to do?’ choose Sentiment analysis:

monkey3

Finally, we have to give some info about the content itself. In kind of text select Comments or reviews, and in text language select English, since that’s the language of the reviews we are working with:

monkey4

After clicking Create, we will be in the module detail page.

5. Feeding the classifier with our data

Now it’s time to feed the model with the data we just scraped. In the Sandbox tab, go to the Samples menu and select Upload:

monkeyfeed1

You’ll be asked for the data source type, select CSV and click Next:

monkeyfeed2

Then, select the CSV file we created with Scrapy’s data:

monkeyfeed3

Finally, you are prompted to provide some info about the training data you are about to upload. On the content column, select Use as text, and in the category column, select Use as category. Afterwards, click Upload:

monkeyfeed4

After the upload finishes, MonkeyLearn will create the corresponding category tree. Go to the Sandbox tab, and to the left you will see the category tree, where there are three nodes: Root (the starting point) and our sentiment categories: ‘Good’ and ‘Bad’:

monkeyfeed5

6. Training the classifier

Ok, now an easy step: training the machine learning algorithm. This only involves hitting the Train button at the top right of the screen. You will see a progress bar while the machine learning algorithms are training the model in MonkeyLearn’s cloud. It will take a few seconds or a few minutes depending on the complexity and size of your category tree and samples.

After finishing the training, the module state will change to TRAINED, and you’ll get some statistics that show how well the module is doing in predicting the correct category (in our case the sentiment):

monkeytrain1.1

If you select one of the categories on the tree, you will get stats about that category in particular.

monkeytrain2.1

The metrics are Accuracy, Precision and Recall, this measurements are common in machine learning for evaluating the performance of classification algorithms.

You can also see a keyword cloud on the bottom, that shows some of the terms that will be used to characterize the samples and predict the sentiment of the text. As you can see, they are terms that are semantically associated with positive and negative expressions about hotel features. Those terms are automatically obtained with statistical algorithms within MonkeyLearn.

You can learn more about the stats here.

Finally, if you select Parameters, you can see the parameters that the module is trained with. These were set automatically according to our preferences in step 4, which should be the best for our classifier. However, you can try and change them to see if you can get better results.

monkeysettings

 

Here you can set the language, N-gram range, algorithm used, and more. If you wish to learn more, check out this article.

If you want to look at the finished classifier, we created a public classifier with the hotel sentiment analysis.

7. Testing our classifier

And voilá, we have our sentiment analysis classifier with just a few lines of code. You can check it out here.

We can test the model directly from the GUI within MonkeyLearn. Go to the Classify tab, there you write or paste a text, submit and you’ll get the prediction, for example:

monkeytest1

The results shows what a call to the classification endpoint from MonkeyLearn’s API would respond. What is important now is to take a look at the “result” entry that shows the predicted label, in this case ‘Good’, and the corresponding probability: 0.98. The label in our case will always be ‘Good’ or ‘Bad’, and the probability is a real number between 0 and 1. 1 means that the classifier is 100% sure about the prediction.

Your classifier may still have some errors, that is, classify good reviews as bad, and vice versa, but the good thing is that you can keep improving, if you gather more training samples with tools like Scrapy (in our example, by getting reviews from more cities), you can upload more samples to the classifier, retrain and improve the results. Also, you can try different configurations on the parameters of your classifier, and retrain the algorithm.

8. Integrating the module with MonkeyLearn’s API

You can do the same but programmatically, so you can easily integrate any MonkeyLearn module within your projects with any programming language. For example, if we are working with the Python programming language, you can go to the API libraries, select the corresponding programming language and copy and paste the code snippet:

monkeyAPI1


Conclusion

On this tutorial we learned how to use Scrapy and MonkeyLearn for training a machine learning model that can analyze millions of reviews and predict their sentiment. With just a few lines of code, we can easily understand how customers feel about hotels in NY. Do they like the rooms? Do they hate the service? How do they compare to hotels in San Francisco?

Got any cool ideas on how to use Scrapy and MonkeyLearn? Share them with us in the comments.