In this article, we’ll explore what data preprocessing is, why it’s important, and how to clean, transform, integrate and reduce our data.
Why Is Data Preprocessing Needed?
Data preprocessing is a fundamental step in data analysis and machine learning. It’s an intricate process that sets the stage for the success of any data-driven endeavor.
At its core, data preprocessing encompasses an array of techniques to transform raw, unrefined data into a structured and coherent format ripe for insightful analysis and modeling.
This vital preparatory phase is the backbone for extracting valuable knowledge and wisdom from data, empowering decision-making and predictive modeling across diverse domains.
The need for data preprocessing arises from real-world data’s inherent imperfections and complexities. Often acquired from different sources, raw data tends to be riddled with missing values, outliers, inconsistencies, and noise. These flaws can obstruct the analytical process, endangering the reliability and accuracy of the conclusions drawn. Moreover, data collected from various channels may vary in scales, units, and formats, making direct comparisons arduous and potentially misleading.
Data preprocessing typically involves several steps, including data cleaning, data transformation, data integration, and data reduction. We’ll explore each of these in turn below.
Data Cleaning
Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data. Some standard techniques used in data cleaning include:
- handling missing values
- handling duplicates
- handling outliers
Let’s discuss each of these data-cleaning techniques in turn.
Handling missing values
Handling missing values is an essential part of data preprocessing. Observations with missing data are dealt with under this technique. We’ll discuss three standard methods for handling missing values: removing observations (rows) with missing values, imputing missing values with the statistics tools, and imputing missing values with machine learning algorithms.
We will demonstrate each technique with a custom dataset and explain the output of each method, discussing all of these techniques of handling missing values individually.
Dropping observations with missing values
The simplest way to deal with missing values is to drop rows with missing ones. This method usually isn’t recommended, as it can affect our dataset by removing rows containing essential data.
Let’s understand this method with the help of an example. We create a custom dataset with age, income, and education data. We introduce missing values by setting some values to NaN
(not a number). NaN
is a special floating-point value that indicates an invalid or undefined result. The observations with NaN
will be dropped with the help of the dropna()
function from the Pandas library:
import pandas as pd
import numpy as np
data = pd.DataFrame({'age': [20, 25, np.nan, 35, 40, np.nan],
'income': [50000, np.nan, 70000, np.nan, 90000, 100000],
'education': ['Bachelor', np.nan, 'PhD', 'Bachelor', 'Master', np.nan]})
data_cleaned = data.dropna(axis=0)
print("Original dataset:")
print(data)
print("nCleaned dataset:")
print(data_cleaned)
The output of the above code is given below. Note that the output won’t be produced in a bordered table format. We’re providing it in this format to make the output more interpretable, as shown below.
Original dataset
age | income | education |
---|---|---|
20 | 50000 | Bachelor |
25 | NaN | NaN |
NaN | 70000 | PhD |
35 | NaN | Bachelor |
40 | 90000 | Master |
NaN | 100000 | NaN |
Cleaned dataset
age | income | education |
---|---|---|
20 | 50000 | Bachelor |
40 | 90000 | Master |
The observations with missing values are removed in the cleaned dataset, so only the observations without missing values are kept. You’ll find that only row 0 and 4 are in the cleaned dataset.
Dropping rows or columns with missing values can significantly reduce the number of observations in our dataset. This may affect the accuracy and generalization of our machine-learning model. Therefore, we should use this approach cautiously and only when we have a large enough dataset or when the missing values aren’t essential for analysis.
Imputing missing values with statistics tools
This is a more sophisticated way to deal with missing data compared with the previous one. It replaces the missing values with some statistics, such as the mean, median, mode, or constant value.
This time, we create a custom dataset with age, income, gender, and marital_status data with some missing (NaN
) values. We then impute the missing values with the median using the fillna()
function from the Pandas library:
import pandas as pd
import numpy as np
data = pd.DataFrame({'age': [20, 25, 30, 35, np.nan, 45],
'income': [50000, np.nan, 70000, np.nan, 90000, 100000],
'gender': ['M', 'F', 'F', 'M', 'M', np.nan],
'marital_status': ['Single', 'Married', np.nan, 'Married', 'Single', 'Single']})
data_imputed = data.fillna(data.median())
print("Original dataset:")
print(data)
print("nImputed dataset:")
print(data_imputed)
The output of the above code in table form is shown below.
Original dataset
age | income | gender | marital_status |
---|---|---|---|
20 | 50000 | M | Single |
25 | NaN | F | Married |
30 | 70000 | F | NaN |
35 | NaN | M | Married |
NaN | 90000 | M | Single |
45 | 100000 | NaN | Single |
Imputed dataset
age | income | gender | marital_status |
---|---|---|---|
20 | 50000 | M | Single |
30 | 90000 | F | Married |
30 | 70000 | F | Single |
35 | 90000 | M | Married |
30 | 90000 | M | Single |
45 | 100000 | M | Single |
In the imputed dataset, the missing values in the age, income, gender, and marital_status columns are replaced with their respective column medians.
Imputing missing values with machine learning algorithms
Machine-learning algorithms provide a sophisticated way to deal with missing values based on features of our data. For example, the KNNImputer
class from the Scikit-learn library is a powerful way to impute missing values. Let’s understand this with the help of a code example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, np.nan, 40, 45],
'gender': ['F', 'M', 'M', np.nan, 'F'],
'salary': [5000, 6000, 7000, 8000, np.nan]})
print('Original Dataset')
print(df)
from sklearn.impute import KNNImputer
imputer = KNNImputer()
df['gender'] = df['gender'].map({'F': 0, 'M': 1})
df_imputed = imputer.fit_transform(df[['age', 'gender', 'salary']])
df_imputed = pd.DataFrame(df_imputed, columns=['age', 'gender', 'salary'])
df_imputed['name'] = df['name']
print('Dataset after imputing with KNNImputer')
print(df_imputed)
The output of this code is shown below.
Original Dataset
name | age | gender | salary |
---|---|---|---|
Alice | 25.0 | F | 5000.0 |
Bob | 30.0 | M | 6000.0 |
Charlie | NaN | M | 7000.0 |
David | 40.0 | NaN | 8000.0 |
Eve | 45.0 | F | NaN |
Dataset after imputing with KNNImputer
age | gender | salary | name |
---|---|---|---|
25.0 | 0.0 | 5000.000000 | Alice |
30.0 | 1.0 | 6000.000000 | Bob |
37.5 | 1.0 | 7000.000000 | Charlie |
40.0 | 1.0 | 8000.000000 | David |
45.0 | 0.0 | 6666.666667 | Eve |
The above example demonstrates that imputing missing values with machine learning can produce more realistic and accurate values than imputing with statistics, as it considers the relationship between the features and the missing values. However, this approach can also be more computationally expensive and complex than imputing with statistics, as it requires choosing and tuning a suitable machine learning algorithm and its parameters. Therefore, we should use this approach when we have sufficient data, and the missing values are not random or trivial for your analysis.
It’s important to note that many machine-learning algorithms can handle missing values internally. XGBoost, LightGBM, and CatBoost are brilliant examples of machine-learning algorithms supporting missing values. These algorithms take missing values internally by ignoring missing ones, splitting missing values, and so on. But this approach doesn’t work well on all types of data. It can result in bias and noise in our model.
Handling duplicates
There are many times we have to deal with data with duplicate rows — such as rows with the same data in all columns. This process involves the identification and removal of duplicated rows in the dataset.
Here, the duplicated()
and drop_duplicates()
functions can us. The duplicated()
function is used to find the duplicated rows in the data, while the drop_duplicates()
function removes these rows. This technique can also lead to the removal of important data. So it’s important to analyze the data before applying this method:
import pandas as pd
data = pd.DataFrame({'name': ['John', 'Emily', 'Peter', 'John', 'Emily'],
'age': [20, 25, 30, 20, 25],
'income': [50000, 60000, 70000, 50000, 60000]})
duplicates = data[data.duplicated()]
data_deduplicated = data.drop_duplicates()
print("Original dataset:")
print(data)
print("nDuplicate rows:")
print(duplicates)
print("nDeduplicated dataset:")
print(data_deduplicated)
The output of the above code is shown below.
Original dataset
name | age | income |
---|---|---|
John | 20 | 50000 |
Emily | 25 | 60000 |
Peter | 30 | 70000 |
John | 20 | 50000 |
Emily | 25 | 60000 |
Duplicate rows
name | age | income |
---|---|---|
John | 20 | 50000 |
Emily | 25 | 60000 |
Deduplicated dataset
name | age | income |
---|---|---|
John | 20 | 50000 |
Emily | 25 | 60000 |
Peter | 30 | 70000 |
The duplicate rows are removed from the original dataset based on the deduplicated dataset’s name, age, and income columns.
Handing outliers
In real-world data analysis, we often come across data with outliers. Outliers are very small or huge values that deviate significantly from other observations in a dataset. Such outliers are first identified, then removed, and the dataset is transformed at a specific scale. Let’s understand with the following detail.
Identifying outliers
As we’ve already seen, the first step is to identify the outliers in our dataset. Various statistical techniques can be used for this, such as the interquartile range (IQR), z-score, or Tukey methods.
We’ll mainly look at z-score. It’s a common technique for the identification of outliers in the dataset.
The z-score measures how many standard deviations an observation is from the mean of the dataset. The formula for calculating the z-score of an observation is this:
z = (observation - mean) / standard deviation
The threshold for the z-score method is typically chosen based on the level of significance or the desired level of confidence in identifying outliers. A commonly used threshold is a z-score of 3, meaning any observation with a z-score more significant than 3 or less than -3 is considered an outlier.
Removing outliers
Once the outliers are identified, they can be removed from the dataset using various techniques such as trimming, or removing the observations with extreme values. However, it’s important to carefully analyze the dataset and determine the appropriate technique for handling outliers.
Transforming the data
Alternatively, the data can be transformed using mathematical functions such as logarithmic, square root, or inverse functions to reduce the impact of outliers on the analysis:
import pandas as pd
import numpy as np
data = pd.DataFrame({'age': [20, 25, 30, 35, 40, 200],
'income': [50000, 60000, 70000, 80000, 90000, 100000]})
mean = data.mean()
std_dev = data.std()
threshold = 3
z_scores = ((data - mean) / std_dev).abs()
outliers = data[z_scores > threshold]
data_without_outliers = data[z_scores <= threshold]
print("Original dataset:")
print(data)
print("nOutliers:")
print(outliers)
print("nDataset without outliers:")
print(data_without_outliers)
In this example, we’ve created a custom dataset with outliers in the age column. We then apply the outlier handling technique to identify and remove outliers from the dataset. We first calculate the mean and standard deviation of the data, and then identify the outliers using the z-score method. The z-score is calculated for each observation in the dataset, and any observation that has a z-score greater than the threshold value (in this case, 3) is considered an outlier. Finally, we remove the outliers from the dataset.
The output of the above code in table form is shown below.
Original dataset
age | income |
---|---|
20 | 50000 |
25 | 60000 |
30 | 70000 |
35 | 80000 |
40 | 90000 |
200 | 100000 |
Outliers
Dataset without outliers
age | income |
---|---|
20 | 50000 |
25 | 60000 |
30 | 70000 |
35 | 80000 |
40 | 90000 |
The outlier (200) in the age column in the dataset without outliers is removed from the original dataset.
Data Transformation
Data transformation is another method in data processing to improve data quality by modifying it. This transformation process involves converting the raw data into a more suitable format for analysis by adjusting the data’s scale, distribution, or format.
- Log transformation is used to reduce outliers’ impact and transform skewed (a situation where the distribution of the target variable or class labels is highly imbalanced) data into a normal distribution. It’s a widely used transformation technique that involves taking the natural logarithm of the data.
- Square root transformation is another technique to transform skewed data into a normal distribution. It involves taking the square root of the data, which can help reduce the impact of outliers and improve the data distribution.
Let’s look at an example:
import pandas as pd
import numpy as np
data = pd.DataFrame({'age': [20, 25, 30, 35, 40, 45],
'income': [50000, 60000, 70000, 80000, 90000, 100000],
'spending': [1, 4, 9, 16, 25, 36]})
data['sqrt_spending'] = np.sqrt(data['spending'])
print("Original dataset:")
print(data)
print("nTransformed dataset:")
print(data[['age', 'income', 'sqrt_spending']])
In this example, our custom dataset has a variable called spending
. A significant outlier in this variable is causing skewness in the data. We’re controlling this skewness in the spending variable. The square root transformation has transformed the skewed spending
variable into a more normal distribution. Transformed values are stored in a new variable called sqrt_spending
. The normal distribution of sqrt_spending
is between 1.00000 to 6.00000, making it more suitable for data analysis.
The output of the above code in table form is shown below.
Original dataset
age | income | spending |
---|---|---|
20 | 50000 | 1 |
25 | 60000 | 4 |
30 | 70000 | 9 |
35 | 80000 | 16 |
40 | 90000 | 25 |
45 | 100000 | 36 |
Transformed dataset
age | income | sqrt_spending |
---|---|---|
20 | 50000 | 1.00000 |
25 | 60000 | 2.00000 |
30 | 70000 | 3.00000 |
35 | 80000 | 4.00000 |
40 | 90000 | 5.00000 |
45 | 100000 | 6.00000 |
Data Integration
The data integration technique combines data from various sources into a single, unified view. This helps to increase the completeness and diversity of the data, as well as resolve any inconsistencies or conflicts that may exist between the different sources. Data integration is helpful for data mining, enabling data analysis spread across multiple systems or platforms.
Let’s suppose we have two datasets. One contains customer IDs and their purchases, while the other dataset contains information on customer IDs and demographics, as given below. We intend to combine these two datasets for a more comprehensive customer behavior analysis.
Customer Purchase Dataset
Customer ID | Purchase Amount |
---|---|
1 | $50 |
2 | $100 |
3 | $75 |
4 | $200 |
Customer Demographics Dataset
Customer ID | Age | Gender |
---|---|---|
1 | 25 | Male |
2 | 35 | Female |
3 | 30 | Male |
4 | 40 | Female |
To integrate these datasets, we need to map the common variable, the customer ID, and combine the data. We can use the Pandas library in Python to accomplish this:
import pandas as pd
purchase_data = pd.DataFrame({'Customer ID': [1, 2, 3, 4],
'Purchase Amount': [50, 100, 75, 200]})
demographics_data = pd.DataFrame({'Customer ID': [1, 2, 3, 4],
'Age': [25, 35, 30, 40],
'Gender': ['Male', 'Female', 'Male', 'Female']})
merged_data = pd.merge(purchase_data, demographics_data, on='Customer ID')
print(merged_data)
The output of the above code in table form is shown below.
Customer ID | Purchase Amount | Age | Gender |
---|---|---|---|
1 | $50 | 25 | Male |
2 | $100 | 35 | Female |
3 | $75 | 30 | Male |
4 | $200 | 40 | Female |
We’ve used the merge()
function from the Pandas library. It merges the two datasets based on the common customer ID variable. It results in a unified dataset containing purchase information and customer demographics. This integrated dataset can now be used for more comprehensive analysis, such as analyzing purchasing patterns by age or gender.
Data Reduction
Data reduction is one of the commonly used techniques in the data processing. It’s used when we have a lot of data with plenty of irrelevant information. This method reduces data without losing the most critical information.
There are different methods of data reduction, such as those listed below.
- Data cube aggregation involves summarizing or aggregating the data along multiple dimensions, such as time, location, product, and so on. This can help reduce the complexity and size of the data, as well as reveal higher-level patterns and trends.
- Dimensionality reduction involves reducing the number of attributes or features in the data by selecting a subset of relevant features or transforming the original features into a lower-dimensional space. This can help remove noise and redundancy and improve the efficiency and accuracy of data mining algorithms.
- Data compression involves encoding the data in a more minor form, by using techniques such as sampling, clustering, histogram analysis, wavelet analysis, and so on. This can help reduce the data’s storage space and transmission cost and speed up data processing.
- Numerosity reduction replaces the original data with a more miniature representation, such as a parametric model (for example, regression, log-linear models, and so on) or a non-parametric model (such as histograms, clusters, and so on). This can help simplify the data structure and analysis and reduce the amount of data to be mined.
Data preprocessing is essential, because the quality of the data directly affects the accuracy and reliability of the analysis or model. By properly preprocessing the data, we can improve the performance of the machine learning models and obtain more accurate insights from the data.
Conclusion
Preparing data for machine learning is like getting ready for a big party. Like cleaning and tidying up a room, data preprocessing involves fixing inconsistencies, filling in missing information, and ensuring that all data points are compatible. Using techniques such as data cleaning, data transformation, data integration, and data reduction, we create a well-prepared dataset that allows computers to identify patterns and learn effectively.
It’s recommended that we explore data in depth, understand data patterns and find the reasons for missingness in data before choosing an approach. Validation and test set are also important ways to evaluate the performance of different techniques.