{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reshaping Data with Pandas\n", "\n", "When working with data, it is common to encounter situations where the format of your `DataFrame` is not ideal for analysis or visualization. Reshaping data can help you transform the structure of your data to better suit your needs. This recipe illustrates two powerful tools available in Pandas for reshaping data, namely\n", "\n", "\n", "- `pivot`: This method allows you to \"pivot\" a DataFrame, turning unique values from one column into new columns and filling them with corresponding values from another column. It is especially useful for converting long-form data (also known as tidy data) into a wide format.\n", " \n", "- `melt`: This method is the inverse of pivot. It is used to \"melt\" a wide-format DataFrame into a long-format, where multiple columns are unpivoted into a single column, making it easier to work with or visualize when you have a large number of variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting ready\n", "Before starting, we are going to create two `DataFrame` objects to work with.\n", "\n", "- `df_vertical` this data is in a vertical/long format where each row represents a record of a single observation or instance, and each column represents a single variable. \n", " \n", "- `df_horizontal` this data is in horizontal/wide format where multiple columns/rows represent levels of a categorical variable." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df_vertical = pd.DataFrame({\n", " 'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr'],\n", " 'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],\n", " 'Product': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B'],\n", " 'Sales': [250, 200, 260, 210, 270, 220, 180, 210],\n", " 'Costs': [130, 100, 130, 100, 120, 120, 100, 100]\n", "})" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df_horizontal = pd.DataFrame({\n", " 'Month': ['Jan', 'Feb', 'Mar', 'Apr'],\n", " 'North': [250, 260, 270, 180],\n", " 'South': [200, 210, 220, 210]\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to do it \n", "\n", "\n", "### Use `pivot` to go from vertical to horizontal\n", "\n", "1. Take a look at the `df_vertical` object" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthRegionProductSalesCosts
0JanNorthA250130
1JanSouthA200100
2FebNorthA260130
3FebSouthA210100
4MarNorthA270120
5MarSouthA220120
6AprNorthB180100
7AprSouthB210100
\n", "
" ], "text/plain": [ " Month Region Product Sales Costs\n", "0 Jan North A 250 130\n", "1 Jan South A 200 100\n", "2 Feb North A 260 130\n", "3 Feb South A 210 100\n", "4 Mar North A 270 120\n", "5 Mar South A 220 120\n", "6 Apr North B 180 100\n", "7 Apr South B 210 100" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vertical" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8, 5)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vertical.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Convert the `df_vertical` into horizontal/wide format by using the Pandas method `pivot`. This method requires three inputs:\n", "- `index`\n", "- `columns`\n", "- `values`\n", "which will determine the structure of the new `DataFrame`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RegionNorthSouth
Month
Apr180210
Feb260210
Jan250200
Mar270220
\n", "
" ], "text/plain": [ "Region North South\n", "Month \n", "Apr 180 210\n", "Feb 260 210\n", "Jan 250 200\n", "Mar 270 220" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vertical.pivot(index='Month', columns='Region', values='Sales')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this format each row represents a month, each column represents a region, and the sales values are filled in accordingly. Here, the `Month` column becomes the index, the `Region` column values become the column names, and the `Sales` values are filled into the corresponding cells. \n", "\n", "Note that the information in the columns `Costs` and `Product` is lost." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Pass a list into the input `values` to obtain a pivot with several values on it. In this case, we are going to pass `Sales` and `Costs`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesCosts
RegionNorthSouthNorthSouth
Month
Apr180210100100
Feb260210130100
Jan250200130100
Mar270220120120
\n", "
" ], "text/plain": [ " Sales Costs \n", "Region North South North South\n", "Month \n", "Apr 180 210 100 100\n", "Feb 260 210 130 100\n", "Jan 250 200 130 100\n", "Mar 270 220 120 120" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vertical.pivot(index='Month', columns='Region', values=['Sales', 'Costs'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Pass a list into the input `columns` to include the information about the product type" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesCosts
ProductABAB
RegionNorthSouthNorthSouthNorthSouthNorthSouth
Month
AprNaNNaN180.0210.0NaNNaN100.0100.0
Feb260.0210.0NaNNaN130.0100.0NaNNaN
Jan250.0200.0NaNNaN130.0100.0NaNNaN
Mar270.0220.0NaNNaN120.0120.0NaNNaN
\n", "
" ], "text/plain": [ " Sales Costs \n", "Product A B A B \n", "Region North South North South North South North South\n", "Month \n", "Apr NaN NaN 180.0 210.0 NaN NaN 100.0 100.0\n", "Feb 260.0 210.0 NaN NaN 130.0 100.0 NaN NaN\n", "Jan 250.0 200.0 NaN NaN 130.0 100.0 NaN NaN\n", "Mar 270.0 220.0 NaN NaN 120.0 120.0 NaN NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vertical.pivot(index='Month', columns=['Product', 'Region'], values=['Sales', 'Costs'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Similarly, you can pass a list into the input `index` to include the information about the product type" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesCosts
RegionNorthSouthNorthSouth
MonthProduct
AprB180210100100
FebA260210130100
JanA250200130100
MarA270220120120
\n", "
" ], "text/plain": [ " Sales Costs \n", "Region North South North South\n", "Month Product \n", "Apr B 180 210 100 100\n", "Feb A 260 210 130 100\n", "Jan A 250 200 130 100\n", "Mar A 270 220 120 120" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_vertical.pivot(index=['Month','Product'], columns=['Region'], values=['Sales', 'Costs'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Use `melt` to go from horizontal to vertical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Take a look at the `df_horizontal` object" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthNorthSouth
0Jan250200
1Feb260210
2Mar270220
3Apr180210
\n", "
" ], "text/plain": [ " Month North South\n", "0 Jan 250 200\n", "1 Feb 260 210\n", "2 Mar 270 220\n", "3 Apr 180 210" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_horizontal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Convert the `df_horizontal` into vertical/long format by using the Pandas function `melt`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variablevalue
0MonthJan
1MonthFeb
2MonthMar
3MonthApr
4North250
5North260
6North270
7North180
8South200
9South210
10South220
11South210
\n", "
" ], "text/plain": [ " variable value\n", "0 Month Jan\n", "1 Month Feb\n", "2 Month Mar\n", "3 Month Apr\n", "4 North 250\n", "5 North 260\n", "6 North 270\n", "7 North 180\n", "8 South 200\n", "9 South 210\n", "10 South 220\n", "11 South 210" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.melt(df_horizontal)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. To get a better result we need to pass the inputs\n", " \n", "- `id_vars`\n", "- `value_vars`\n", "- `var_name`\n", "- `value_name`\n", "\n", "which will determine the structure of the new `DataFrame`" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MonthRegionSales
0JanNorth250
1FebNorth260
2MarNorth270
3AprNorth180
4JanSouth200
5FebSouth210
6MarSouth220
7AprSouth210
\n", "
" ], "text/plain": [ " Month Region Sales\n", "0 Jan North 250\n", "1 Feb North 260\n", "2 Mar North 270\n", "3 Apr North 180\n", "4 Jan South 200\n", "5 Feb South 210\n", "6 Mar South 220\n", "7 Apr South 210" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "melted_df = pd.melt(df_horizontal, id_vars=['Month'], value_vars=['North', 'South'], var_name='Region', value_name='Sales')\n", "melted_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }