Training samples are used to give information to the classifier to let it learn to associate texts to their corresponding categories. This is the way we train or “teach” our classifiers. From the examples, the machine learning model automatically learns to generalize “rules” to classify new unlabeled texts.
In MonkeyLearn, training samples are simply a set of plain text examples that could be extracted from webpages, articles, news, tweets, reviews, emails, chat conversations etc. These texts should be representative of future texts that you would want to classify. Usually training data could be in the following states:
- Tagged data: for each text you have the corresponding tag/category.
- Untagged data: you just have a bunch of texts without any tag or category.
With this in mind, usually you’ll be in one of the following situations:
- You have tagged data. That’s great, after getting the data we recommend to take a look and check if the tagging is correct. If they are, you are ready to add these samples to MonkeyLearn and train your classifier!
- You have untagged data. In that case, you should use tools to curate and tag the data. See tagging data for more information.
- You don’t have data at all. In that case, we have to create the training dataset from scratch (gather and tag a training set), the following are some tips to first gather the data.
Tips to Gather Data
- Use data that you already have in your own database.
- Automatically / programmatically get data from the web, by using tools like:
- Web Scraping tools:
- APIs provided by sites and companies (some free and some commercial), eg:
- Manually / Semi-automatically:
- Doing manual searches within the corresponding website’s search bar using related keywords to each category, and manually copying and pasting the title and content.
- Doing a manual search in Google with related keywords and restricting to a particular domain.
- Any particular tool or technique that you are familiar with.
We had a look at some tips to gather a training set, in case the data is not tagged, next steps are to have a brief idea on tag the training set.