{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Handling Duplicated and Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting ready\n", "Before starting, we are going to create a `DataFrame` containing duplicated and missing values." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
0AndreaSmith10.08.010.0
1BertWilliamsNaN9.09.0
2CarlAddison9.0NaN9.5
3DavidBrown8.58.58.5
4ElenaMillerNaN9.09.0
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "0 Andrea Smith 10.0 8.0 10.0\n", "1 Bert Williams NaN 9.0 9.0\n", "2 Carl Addison 9.0 NaN 9.5\n", "3 David Brown 8.5 8.5 8.5\n", "4 Elena Miller NaN 9.0 9.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({ \n", " 'Name': ['Andrea', 'Bert', 'Carl', 'David', 'Elena', 'Andrea', 'Louis', 'Lucy', 'Mary', 'Zoe', 'Bert', 'Louis', 'Elena', 'Bert'], \n", " 'Surname': ['Smith', 'Williams', 'Addison', 'Brown', 'Miller', 'Smith', 'Taylor', 'Brown', 'Moore', 'Anderson', 'Steward', 'Smith', 'Brown', 'Williams'], \n", "\n", " 'Maths' : [10., np.nan, 9.0, 8.5, np.nan, 10., 10.0, 7.5, 9.0, 6.5, 7.5, 10.0, 8.5, np.nan], \n", " 'Spanish' : [8.0, 9.0, np.nan, 8.5, 9.0, 8.0, 10., 1.0, 7.5, 6.5, 8.5, 10.0, 9.5, 9.0],\n", " 'Chemistry' : [10., 9.0, 9.5, 8.5, 9.0, 10., 10.0, 7.5, np.nan, 6.5, 10., 10.0, 7.5, 9.0]\n", "})\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to do it \n", "\n", "### Duplicates\n", "\n", "1. Check for duplicated rows by calling the method `duplicated`. This returns a pandas series with a flag which allows us to identify the duplicated values. By default, for each set of duplicated values, the first occurrence is set on `False` and all others on `True`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 True\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 True\n", "dtype: bool" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Inspect the duplicated rows in our `DataFrame`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
5AndreaSmith10.08.010.0
13BertWilliamsNaN9.09.0
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "5 Andrea Smith 10.0 8.0 10.0\n", "13 Bert Williams NaN 9.0 9.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicated = df.duplicated()\n", "df[duplicated]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Find duplicates on specific column(s) by using the input `subset`. \n", "For example, we can check for students with the same name as follows" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
5AndreaSmith10.08.010.0
10BertSteward7.58.510.0
11LouisSmith10.010.010.0
12ElenaBrown8.59.57.5
13BertWilliamsNaN9.09.0
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "5 Andrea Smith 10.0 8.0 10.0\n", "10 Bert Steward 7.5 8.5 10.0\n", "11 Louis Smith 10.0 10.0 10.0\n", "12 Elena Brown 8.5 9.5 7.5\n", "13 Bert Williams NaN 9.0 9.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicated_name =df.duplicated(subset=['Name'])\n", "df[duplicated_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Find duplicated across a specifi set of columns. For example, we can check for students with the same name and surname as follows" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
5AndreaSmith10.08.010.0
13BertWilliamsNaN9.09.0
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "5 Andrea Smith 10.0 8.0 10.0\n", "13 Bert Williams NaN 9.0 9.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "duplicated_name =df.duplicated(subset=['Name', 'Surname'])\n", "df[duplicated_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Remove the duplicated values from the `DataFrame` by calling the method `drop_duplicates`" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
0AndreaSmith10.08.010.0
1BertWilliamsNaN9.09.0
2CarlAddison9.0NaN9.5
3DavidBrown8.58.58.5
4ElenaMillerNaN9.09.0
6LouisTaylor10.010.010.0
7LucyBrown7.51.07.5
8MaryMoore9.07.5NaN
9ZoeAnderson6.56.56.5
10BertSteward7.58.510.0
11LouisSmith10.010.010.0
12ElenaBrown8.59.57.5
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "0 Andrea Smith 10.0 8.0 10.0\n", "1 Bert Williams NaN 9.0 9.0\n", "2 Carl Addison 9.0 NaN 9.5\n", "3 David Brown 8.5 8.5 8.5\n", "4 Elena Miller NaN 9.0 9.0\n", "6 Louis Taylor 10.0 10.0 10.0\n", "7 Lucy Brown 7.5 1.0 7.5\n", "8 Mary Moore 9.0 7.5 NaN\n", "9 Zoe Anderson 6.5 6.5 6.5\n", "10 Bert Steward 7.5 8.5 10.0\n", "11 Louis Smith 10.0 10.0 10.0\n", "12 Elena Brown 8.5 9.5 7.5" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing Values\n", "\n", "6. Check if a particular column has missing values using the method `isna`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 False\n", "4 True\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 True\n", "Name: Maths, dtype: bool" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Maths'].isna()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
1BertWilliamsNaN9.09.0
4ElenaMillerNaN9.09.0
13BertWilliamsNaN9.09.0
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "1 Bert Williams NaN 9.0 9.0\n", "4 Elena Miller NaN 9.0 9.0\n", "13 Bert Williams NaN 9.0 9.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Maths'].isna()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Check for missing values in all columns of a `DataFrame` with the methods `isnull` and `sum`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name 0\n", "Surname 0\n", "Maths 3\n", "Spanish 1\n", "Chemistry 1\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Drop the rows with missing values" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSurnameMathsSpanishChemistry
0AndreaSmith10.08.010.0
3DavidBrown8.58.58.5
5AndreaSmith10.08.010.0
6LouisTaylor10.010.010.0
7LucyBrown7.51.07.5
9ZoeAnderson6.56.56.5
10BertSteward7.58.510.0
11LouisSmith10.010.010.0
12ElenaBrown8.59.57.5
\n", "
" ], "text/plain": [ " Name Surname Maths Spanish Chemistry\n", "0 Andrea Smith 10.0 8.0 10.0\n", "3 David Brown 8.5 8.5 8.5\n", "5 Andrea Smith 10.0 8.0 10.0\n", "6 Louis Taylor 10.0 10.0 10.0\n", "7 Lucy Brown 7.5 1.0 7.5\n", "9 Zoe Anderson 6.5 6.5 6.5\n", "10 Bert Steward 7.5 8.5 10.0\n", "11 Louis Smith 10.0 10.0 10.0\n", "12 Elena Brown 8.5 9.5 7.5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }