4. Handling Duplicated and Missing Values#
Getting ready#
Before starting, we are going to create a DataFrame
containing duplicated and missing values.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Andrea', 'Bert', 'Carl', 'David', 'Elena', 'Andrea', 'Louis', 'Lucy', 'Mary', 'Zoe', 'Bert', 'Louis', 'Elena', 'Bert'],
'Surname': ['Smith', 'Williams', 'Addison', 'Brown', 'Miller', 'Smith', 'Taylor', 'Brown', 'Moore', 'Anderson', 'Steward', 'Smith', 'Brown', 'Williams'],
'Maths' : [10., np.nan, 9.0, 8.5, np.nan, 10., 10.0, 7.5, 9.0, 6.5, 7.5, 10.0, 8.5, np.nan],
'Spanish' : [8.0, 9.0, np.nan, 8.5, 9.0, 8.0, 10., 1.0, 7.5, 6.5, 8.5, 10.0, 9.5, 9.0],
'Chemistry' : [10., 9.0, 9.5, 8.5, 9.0, 10., 10.0, 7.5, np.nan, 6.5, 10., 10.0, 7.5, 9.0]
})
df.head()
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
0 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
1 | Bert | Williams | NaN | 9.0 | 9.0 |
2 | Carl | Addison | 9.0 | NaN | 9.5 |
3 | David | Brown | 8.5 | 8.5 | 8.5 |
4 | Elena | Miller | NaN | 9.0 | 9.0 |
How to do it#
Duplicates#
Check for duplicated rows by calling the method
duplicated
. This returns a pandas series with a flag which allows us to identify the duplicated values. By default, for each set of duplicated values, the first occurrence is set onFalse
and all others onTrue
.
df.duplicated()
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True
dtype: bool
Inspect the duplicated rows in our
DataFrame
duplicated = df.duplicated()
df[duplicated]
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
5 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
13 | Bert | Williams | NaN | 9.0 | 9.0 |
Find duplicates on specific column(s) by using the input
subset
. For example, we can check for students with the same name as follows
duplicated_name =df.duplicated(subset=['Name'])
df[duplicated_name]
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
5 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
10 | Bert | Steward | 7.5 | 8.5 | 10.0 |
11 | Louis | Smith | 10.0 | 10.0 | 10.0 |
12 | Elena | Brown | 8.5 | 9.5 | 7.5 |
13 | Bert | Williams | NaN | 9.0 | 9.0 |
Find duplicated across a specifi set of columns. For example, we can check for students with the same name and surname as follows
duplicated_name =df.duplicated(subset=['Name', 'Surname'])
df[duplicated_name]
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
5 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
13 | Bert | Williams | NaN | 9.0 | 9.0 |
Remove the duplicated values from the
DataFrame
by calling the methoddrop_duplicates
df.drop_duplicates()
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
0 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
1 | Bert | Williams | NaN | 9.0 | 9.0 |
2 | Carl | Addison | 9.0 | NaN | 9.5 |
3 | David | Brown | 8.5 | 8.5 | 8.5 |
4 | Elena | Miller | NaN | 9.0 | 9.0 |
6 | Louis | Taylor | 10.0 | 10.0 | 10.0 |
7 | Lucy | Brown | 7.5 | 1.0 | 7.5 |
8 | Mary | Moore | 9.0 | 7.5 | NaN |
9 | Zoe | Anderson | 6.5 | 6.5 | 6.5 |
10 | Bert | Steward | 7.5 | 8.5 | 10.0 |
11 | Louis | Smith | 10.0 | 10.0 | 10.0 |
12 | Elena | Brown | 8.5 | 9.5 | 7.5 |
Missing Values#
Check if a particular column has missing values using the method
isna
df['Maths'].isna()
0 False
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True
Name: Maths, dtype: bool
df[df['Maths'].isna()]
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
1 | Bert | Williams | NaN | 9.0 | 9.0 |
4 | Elena | Miller | NaN | 9.0 | 9.0 |
13 | Bert | Williams | NaN | 9.0 | 9.0 |
Check for missing values in all columns of a
DataFrame
with the methodsisnull
andsum
df.isnull().sum()
Name 0
Surname 0
Maths 3
Spanish 1
Chemistry 1
dtype: int64
Drop the rows with missing values
df.dropna()
Name | Surname | Maths | Spanish | Chemistry | |
---|---|---|---|---|---|
0 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
3 | David | Brown | 8.5 | 8.5 | 8.5 |
5 | Andrea | Smith | 10.0 | 8.0 | 10.0 |
6 | Louis | Taylor | 10.0 | 10.0 | 10.0 |
7 | Lucy | Brown | 7.5 | 1.0 | 7.5 |
9 | Zoe | Anderson | 6.5 | 6.5 | 6.5 |
10 | Bert | Steward | 7.5 | 8.5 | 10.0 |
11 | Louis | Smith | 10.0 | 10.0 | 10.0 |
12 | Elena | Brown | 8.5 | 9.5 | 7.5 |