4. Handling Duplicated and Missing Values#

Getting ready#

Before starting, we are going to create a DataFrame containing duplicated and missing values.

import pandas as pd
import numpy as np
df = pd.DataFrame({    
    'Name': ['Andrea', 'Bert', 'Carl', 'David', 'Elena', 'Andrea', 'Louis', 'Lucy', 'Mary', 'Zoe', 'Bert', 'Louis', 'Elena', 'Bert'],         
 'Surname': ['Smith', 'Williams', 'Addison', 'Brown', 'Miller',  'Smith', 'Taylor', 'Brown', 'Moore', 'Anderson', 'Steward', 'Smith', 'Brown', 'Williams'],         

    'Maths' : [10.,  np.nan, 9.0, 8.5, np.nan, 10., 10.0, 7.5, 9.0, 6.5, 7.5, 10.0, 8.5, np.nan], 
    'Spanish' : [8.0, 9.0, np.nan, 8.5, 9.0, 8.0, 10., 1.0, 7.5, 6.5, 8.5, 10.0, 9.5, 9.0],
    'Chemistry' :  [10., 9.0, 9.5, 8.5, 9.0,  10., 10.0, 7.5, np.nan, 6.5, 10., 10.0, 7.5, 9.0]
})
df.head()
Name Surname Maths Spanish Chemistry
0 Andrea Smith 10.0 8.0 10.0
1 Bert Williams NaN 9.0 9.0
2 Carl Addison 9.0 NaN 9.5
3 David Brown 8.5 8.5 8.5
4 Elena Miller NaN 9.0 9.0

How to do it#

Duplicates#

  1. Check for duplicated rows by calling the method duplicated. This returns a pandas series with a flag which allows us to identify the duplicated values. By default, for each set of duplicated values, the first occurrence is set on False and all others on True.

df.duplicated()
0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
dtype: bool
  1. Inspect the duplicated rows in our DataFrame

duplicated = df.duplicated()
df[duplicated]
Name Surname Maths Spanish Chemistry
5 Andrea Smith 10.0 8.0 10.0
13 Bert Williams NaN 9.0 9.0
  1. Find duplicates on specific column(s) by using the input subset. For example, we can check for students with the same name as follows

duplicated_name =df.duplicated(subset=['Name'])
df[duplicated_name]
Name Surname Maths Spanish Chemistry
5 Andrea Smith 10.0 8.0 10.0
10 Bert Steward 7.5 8.5 10.0
11 Louis Smith 10.0 10.0 10.0
12 Elena Brown 8.5 9.5 7.5
13 Bert Williams NaN 9.0 9.0
  1. Find duplicated across a specifi set of columns. For example, we can check for students with the same name and surname as follows

duplicated_name =df.duplicated(subset=['Name', 'Surname'])
df[duplicated_name]
Name Surname Maths Spanish Chemistry
5 Andrea Smith 10.0 8.0 10.0
13 Bert Williams NaN 9.0 9.0
  1. Remove the duplicated values from the DataFrame by calling the method drop_duplicates

df.drop_duplicates()
Name Surname Maths Spanish Chemistry
0 Andrea Smith 10.0 8.0 10.0
1 Bert Williams NaN 9.0 9.0
2 Carl Addison 9.0 NaN 9.5
3 David Brown 8.5 8.5 8.5
4 Elena Miller NaN 9.0 9.0
6 Louis Taylor 10.0 10.0 10.0
7 Lucy Brown 7.5 1.0 7.5
8 Mary Moore 9.0 7.5 NaN
9 Zoe Anderson 6.5 6.5 6.5
10 Bert Steward 7.5 8.5 10.0
11 Louis Smith 10.0 10.0 10.0
12 Elena Brown 8.5 9.5 7.5

Missing Values#

  1. Check if a particular column has missing values using the method isna

df['Maths'].isna()
0     False
1      True
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
Name: Maths, dtype: bool
df[df['Maths'].isna()]
Name Surname Maths Spanish Chemistry
1 Bert Williams NaN 9.0 9.0
4 Elena Miller NaN 9.0 9.0
13 Bert Williams NaN 9.0 9.0
  1. Check for missing values in all columns of a DataFrame with the methods isnull and sum

df.isnull().sum()
Name         0
Surname      0
Maths        3
Spanish      1
Chemistry    1
dtype: int64
  1. Drop the rows with missing values

df.dropna()
Name Surname Maths Spanish Chemistry
0 Andrea Smith 10.0 8.0 10.0
3 David Brown 8.5 8.5 8.5
5 Andrea Smith 10.0 8.0 10.0
6 Louis Taylor 10.0 10.0 10.0
7 Lucy Brown 7.5 1.0 7.5
9 Zoe Anderson 6.5 6.5 6.5
10 Bert Steward 7.5 8.5 10.0
11 Louis Smith 10.0 10.0 10.0
12 Elena Brown 8.5 9.5 7.5