Data cleaning in python using pandas
Real-world data sets do not come clean and therefore we have to do data processing or cleaning so that the data is suitable for analysis. Data cleaning involves cleaning data gaps, correcting data errors, joining data sets, aggregating data, and creating new fields.
In this blog, we are going to discuss some simple data cleaning methods using pandas.
This is the data set we are going to use in this blog. It is a very simple data set of 18 rows named marks_data, but In the real world, we deal with very large datasets.
We have to first import the packages we are going to use.
Dealing with missing values.
The way we deal with missing values depends on how much we understand our data. There are strategies for dealing with missing values which depend on the context. For example, is the data missing because it was not entered or unavailable? Pandas represents missing values with NaN
Filling in missing values with a value.
We are going to create a new data frame called cleaned_data where we shall store our cleaned data.
We have missing values(NaN) at indexes 8 and 16. we are going to fill in the NaN values with “unknown”. We shall use the fillna() function
From the data frame above, we see that the NaN values at indexes 8 and 16 have been replaced with “unknown”.
Dropping the rows with missing data using the dropna() function
The other approach to dealing with missing values is to drop the rows with missing values. We are going to drop the NaN values at indexes 8 and 16 using the code below.
From the picture of the data frame above, we see that indexes 8 and 16 have been dropped.
Dealing with duplicate values.
If you are keen enough, you would have noticed that Claire appears twice at indexes 12 and 14, and its not easy to spot duplicate values, even with such a small dataset, and even harder with very large datasets.
The code used to find duplicate values is DataFrame .duplicated()
The usual strategy with dealing with duplicate values is to drop the duplicate values, because the extra one is not adding anything new.
We shall use the drop_duplicates() function. it finds the duplicates and drops them.
From our output, we see that index 14 has been dropped.
Fixing inconsistent labels
In our dataset, you realize that it has inconsistent labels. For example, we have Math and Mathematics. So if we were to run code to find the different subjects in the dataset, we would find five instead of four because it would consider Math and Mathematics. After all, the computer doesn't differentiate that they are two different subjects. but as a human being, you can tell that Math and mathematics are the same.
Line 2 code gets us the list of the unique subjects in the dataset. Line 3 gets us the sorted list of the subjects.
We are going to use the update() command. Line 7 code updates the value from Mathematics to Math in column Subject at index 5.
So we see that the subject at index 5 was changed from Mathematics to Math.
Fixing letters within numbers
If you are keen, you might have noticed that there are letters within the numbers, for example at index 3 and 17. So if you tried to sum, you would get an error.
So we shall use the update command we used above to fix that problem.
We see that the output at index 3 has been updated from 7o to 70.
I hope you found this blog helpful, you can check out my repository on GitHub showing how I did the data cleaning. https://github.com/CynthiaKuzira/personal-work/blob/Main/Marks%20data.ipynb