5. Reshaping Data with Pandas#
When working with data, it is common to encounter situations where the format of your DataFrame
is not ideal for analysis or visualization. Reshaping data can help you transform the structure of your data to better suit your needs. This recipe illustrates two powerful tools available in Pandas for reshaping data, namely
pivot
: This method allows you to “pivot” a DataFrame, turning unique values from one column into new columns and filling them with corresponding values from another column. It is especially useful for converting long-form data (also known as tidy data) into a wide format.melt
: This method is the inverse of pivot. It is used to “melt” a wide-format DataFrame into a long-format, where multiple columns are unpivoted into a single column, making it easier to work with or visualize when you have a large number of variables.
Getting ready#
Before starting, we are going to create two DataFrame
objects to work with.
df_vertical
this data is in a vertical/long format where each row represents a record of a single observation or instance, and each column represents a single variable.df_horizontal
this data is in horizontal/wide format where multiple columns/rows represent levels of a categorical variable.
import pandas as pd
import numpy as np
df_vertical = pd.DataFrame({
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr'],
'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
'Product': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B'],
'Sales': [250, 200, 260, 210, 270, 220, 180, 210],
'Costs': [130, 100, 130, 100, 120, 120, 100, 100]
})
df_horizontal = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr'],
'North': [250, 260, 270, 180],
'South': [200, 210, 220, 210]
})
How to do it#
Use pivot
to go from vertical to horizontal#
Take a look at the
df_vertical
object
df_vertical
Month | Region | Product | Sales | Costs | |
---|---|---|---|---|---|
0 | Jan | North | A | 250 | 130 |
1 | Jan | South | A | 200 | 100 |
2 | Feb | North | A | 260 | 130 |
3 | Feb | South | A | 210 | 100 |
4 | Mar | North | A | 270 | 120 |
5 | Mar | South | A | 220 | 120 |
6 | Apr | North | B | 180 | 100 |
7 | Apr | South | B | 210 | 100 |
df_vertical.shape
(8, 5)
Convert the
df_vertical
into horizontal/wide format by using the Pandas methodpivot
. This method requires three inputs:
index
columns
values
which will determine the structure of the newDataFrame
df_vertical.pivot(index='Month', columns='Region', values='Sales')
Region | North | South |
---|---|---|
Month | ||
Apr | 180 | 210 |
Feb | 260 | 210 |
Jan | 250 | 200 |
Mar | 270 | 220 |
In this format each row represents a month, each column represents a region, and the sales values are filled in accordingly. Here, the Month
column becomes the index, the Region
column values become the column names, and the Sales
values are filled into the corresponding cells.
Note that the information in the columns Costs
and Product
is lost.
Pass a list into the input
values
to obtain a pivot with several values on it. In this case, we are going to passSales
andCosts
.
df_vertical.pivot(index='Month', columns='Region', values=['Sales', 'Costs'])
Sales | Costs | |||
---|---|---|---|---|
Region | North | South | North | South |
Month | ||||
Apr | 180 | 210 | 100 | 100 |
Feb | 260 | 210 | 130 | 100 |
Jan | 250 | 200 | 130 | 100 |
Mar | 270 | 220 | 120 | 120 |
Pass a list into the input
columns
to include the information about the product type
df_vertical.pivot(index='Month', columns=['Product', 'Region'], values=['Sales', 'Costs'])
Sales | Costs | |||||||
---|---|---|---|---|---|---|---|---|
Product | A | B | A | B | ||||
Region | North | South | North | South | North | South | North | South |
Month | ||||||||
Apr | NaN | NaN | 180.0 | 210.0 | NaN | NaN | 100.0 | 100.0 |
Feb | 260.0 | 210.0 | NaN | NaN | 130.0 | 100.0 | NaN | NaN |
Jan | 250.0 | 200.0 | NaN | NaN | 130.0 | 100.0 | NaN | NaN |
Mar | 270.0 | 220.0 | NaN | NaN | 120.0 | 120.0 | NaN | NaN |
Similarly, you can pass a list into the input
index
to include the information about the product type
df_vertical.pivot(index=['Month','Product'], columns=['Region'], values=['Sales', 'Costs'])
Sales | Costs | ||||
---|---|---|---|---|---|
Region | North | South | North | South | |
Month | Product | ||||
Apr | B | 180 | 210 | 100 | 100 |
Feb | A | 260 | 210 | 130 | 100 |
Jan | A | 250 | 200 | 130 | 100 |
Mar | A | 270 | 220 | 120 | 120 |
Use melt
to go from horizontal to vertical#
Take a look at the
df_horizontal
object
df_horizontal
Month | North | South | |
---|---|---|---|
0 | Jan | 250 | 200 |
1 | Feb | 260 | 210 |
2 | Mar | 270 | 220 |
3 | Apr | 180 | 210 |
Convert the
df_horizontal
into vertical/long format by using the Pandas functionmelt
pd.melt(df_horizontal)
variable | value | |
---|---|---|
0 | Month | Jan |
1 | Month | Feb |
2 | Month | Mar |
3 | Month | Apr |
4 | North | 250 |
5 | North | 260 |
6 | North | 270 |
7 | North | 180 |
8 | South | 200 |
9 | South | 210 |
10 | South | 220 |
11 | South | 210 |
To get a better result we need to pass the inputs
id_vars
value_vars
var_name
value_name
which will determine the structure of the new DataFrame
melted_df = pd.melt(df_horizontal, id_vars=['Month'], value_vars=['North', 'South'], var_name='Region', value_name='Sales')
melted_df
Month | Region | Sales | |
---|---|---|---|
0 | Jan | North | 250 |
1 | Feb | North | 260 |
2 | Mar | North | 270 |
3 | Apr | North | 180 |
4 | Jan | South | 200 |
5 | Feb | South | 210 |
6 | Mar | South | 220 |
7 | Apr | South | 210 |