Training samples, also known as datasets are used to give information to the classifier to let it learn to associate texts to their corresponding categories. This is the way we train or “teach” our classifiers. From the samples, the machine learning model automatically learns to generalize “rules” to classify new unlabeled texts.
In MonkeyLearn, training samples are simply a set of plain text files that could be extracted from webpages, books, articles, news, tweets, reviews, etc. Usually data could be in the following states:
- Tagged data: for each text you have the corresponding tag/category.
- Untagged data: you just have a bunch of texts without any tag or category.
You training data and that are representative of future texts that you would want to classify.
First thing to ask is you already have data that you could us as a training set, usually you’ll be in one of the following situations:
- You have tagged data. That’s great, after getting the data we should take a look and do the necessary curation / cleaning. See data curation and cleaning tools.
- You have untagged data. In that case, you should use tools to curate and tag the data. See tagging data for more information.
- You don’t have data at all. In that case we have to create the training dataset from scratch (gather and tag a training set), the following are some tips to first gather the data.
Tips to Gather Data
- Use data that you already have in your own database.
- Automatically / programmatically get data from the web, by using tools like:
- Web Scraping tools:
- APIs provided by sites and companies (some free and some comercial), eg:
- Manually / Semi-automatically:
- Doing manual searches within the corresponding website’s search bar using related keywords to each category, and manually copying and pasting the title and content.
- Doing manual search in google with related keywords and restricting to a particular domain.
- Any particular tool or technique that you are familiar with.
We had a look at some tips to gather a training set, in case the data is not tagged, next steps are to have a brief idea on tag the training set.