Companies can use data from nearly endless sources – internal information, customer service interactions, and all over the internet – to help inform their choices and improve their business.
But you can’t simply take raw data and run it through machine learning and analytics programs right away. You first need to preprocess your data, so it can be successfully “read” or understood by machines.
In this guide, learn what data preprocessing is, why it’s an essential step in data mining, and how to go about it.
Let’s get started.
Data preprocessing is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning.
Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular, uniform design.
Machines like to process nice and tidy information – they read data as 1s and 0s. So calculating structured data, like whole numbers and percentages is easy. However, unstructured data, in the form of text and images must first be cleaned and formatted before analysis.
When using data sets to train machine learning models, you’ll often hear the phrase “garbage in, garbage out” This means that if you use bad or “dirty” data to train your model, you’ll end up with a bad, improperly trained model that won’t actually be relevant to your analysis.
Good, preprocessed data is even more important than the most powerful algorithms, to the point that machine learning models trained with bad data could actually be harmful to the analysis you’re trying to do – giving you “garbage” results.
Depending on your data gathering techniques and sources, you may end up with data that’s out of range or includes an incorrect feature, like household income below zero or an image from a set of “zoo animals” that is actually a tree. Your set could have missing values or fields. Or text data, for example, will often have misspelled words and irrelevant symbols, URLs, etc.
When you properly preprocess and clean your data, you’ll set yourself up for much more accurate downstream processes. We often hear about the importance of “data-driven decision making,” but if these decisions are driven by bad data, they’re simply bad decisions.
Data sets can be explained with or communicated as the “features” that make them up. This can be by size, location, age, time, color, etc. Features appear as columns in datasets and are also known as attributes, variables, fields, and characteristics.
Wikipedia describes a machine learning data feature as “an individual measurable property or characteristic of a phenomenon being observed”.
It’s important to understand what “features” are when preprocessing your data because you’ll need to choose which ones to focus on depending on what your business goals are. Later, we’ll explain how you can improve the quality of your dataset’s features and the insights you gain with processes like feature selection
First, let’s go over the two different types of features that are used to describe data: categorical and numerical:
The diagram below shows how features are used to train machine learning text analysis models. Text is run through a feature extractor (to pull out or highlight words or phrases) and these pieces of text are classified or tagged by their features. Once the model is properly trained, text can be run through it, and it will make predictions on the features of the text or “tag” the text itself.
Let’s take a look at the established steps you’ll need to go through to make sure your data is successfully preprocessed.
Take a good look at your data and get an idea of its overall quality, relevance to your project, and consistency. There are a number of data anomalies and inherent problems to look out for in almost any data set, for example:
Data cleaning is the process of adding missing data and correcting, repairing, or removing incorrect or irrelevant data from a data set. Dating cleaning is the most important step of preprocessing because it will ensure that your data is ready to go for your downstream needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data quality assessment. Depending on the kind of data you’re working with, there are a number of possible cleaners you’ll need to run your data through.
There are a number of ways to correct for missing data, but the two most common are:
Data cleaning also includes fixing “noisy” data. This is data that includes unnecessary data points, irrelevant data, and data that’s more difficult to group together.
If you’re working with text data, for example, some things you should consider when cleaning your data are:
After data cleaning, you may realize you have insufficient data for the task at hand. At this point you can also perform data wrangling or data enrichment to add new data sets and run them through quality assessment and cleaning again before adding them to your original data.
With data cleaning, we’ve already begun to modify our data, but data transformation will begin the process of turning the data into the proper format(s) you’ll need for analysis and other downstream processes.
This generally happens in one or more of the below:
The more data you’re working with, the harder it will be to analyze, even after cleaning and transforming it. Depending on your task at hand, you may actually have more data than you need. Especially when working with text analysis, much of regular human speech is superfluous or irrelevant to the needs of the researcher. Data reduction not only makes the analysis easier and more accurate, but cuts down on data storage.
It will also help identify the most important features to the process at hand.
Take a look at the table below to see how preprocessing works. In this example, we have three variables: name, age, and company. In the first example we can tell that #2 and #3 have been assigned the incorrect companies.
Name | Age | Company |
---|---|---|
Karen Lynch | 57 | CVS Health |
Elon Musk | 49 | Amazon |
Jeff Bezos | 57 | Tesla |
Tim Cook | 60 | Apple |
We can use data cleaning to simply remove these rows, as we know the data was improperly entered or is otherwise corrupted.
Name | Age | Company |
---|---|---|
Karen Lynch | 57 | CVS Health |
Tim Cook | 60 | Apple |
Or, we can perform data transformation, in this case, manually, in order to fix the problem:
Name | Age | Company |
---|---|---|
Karen Lynch | 57 | CVS Health |
Elon Musk | 49 | Tesla |
Jeff Bezos | 57 | Amazon |
Tim Cook | 60 | Apple |
Once the issue is fixed, we can perform data reduction, in this case by descending age, to choose which age range we want to focus on:
Name | Age | Company |
---|---|---|
Tim Cook | 60 | Apple |
Karen Lynch | 57 | CVS Health |
Jeff Bezos | 57 | Amazon |
Elon Musk | 49 | Tesla |
Good data-driven decision making requires good, prepared data. Once you’ve decided on the analysis you need to do and where to find the data you need, just follow the steps above and your data will be all set for any number of downstream processes.
Data preprocessing can be a tedious task, for sure, but once you have your methods and procedures set up, you’ll reap the benefits down the line.
Once your data has been processed, you can plug it into tools like MonkeyLearn – a SaaS machine learning platform with text analysis techniques like sentiment analysis (to automatically read text for opinion polarity), keyword extraction (to find the most used and most important words in a text) and intent classification (to read emails and other texts for the intent of the writer).
Take a look at MonkeyLearn to see what these powerful tools (and more) can do for customer feedback and other relevant text from internal systems and all over the web.
May 24th, 2021