5. Reshaping Data with Pandas#
When working with data, it is common to encounter situations where the format of your DataFrame is not ideal for analysis or visualization. Reshaping data can help you transform the structure of your data to better suit your needs. This recipe illustrates two powerful tools available in Pandas for reshaping data, namely
pivot: This method allows you to “pivot” a DataFrame, turning unique values from one column into new columns and filling them with corresponding values from another column. It is especially useful for converting long-form data (also known as tidy data) into a wide format.melt: This method is the inverse of pivot. It is used to “melt” a wide-format DataFrame into a long-format, where multiple columns are unpivoted into a single column, making it easier to work with or visualize when you have a large number of variables.
Getting ready#
Before starting, we are going to create two DataFrame objects to work with.
df_verticalthis data is in a vertical/long format where each row represents a record of a single observation or instance, and each column represents a single variable.df_horizontalthis data is in horizontal/wide format where multiple columns/rows represent levels of a categorical variable.
import pandas as pd
import numpy as np
df_vertical = pd.DataFrame({
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr'],
'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
'Product': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B'],
'Sales': [250, 200, 260, 210, 270, 220, 180, 210],
'Costs': [130, 100, 130, 100, 120, 120, 100, 100]
})
df_horizontal = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr'],
'North': [250, 260, 270, 180],
'South': [200, 210, 220, 210]
})
How to do it#
Use pivot to go from vertical to horizontal#
Take a look at the
df_verticalobject
df_vertical
| Month | Region | Product | Sales | Costs | |
|---|---|---|---|---|---|
| 0 | Jan | North | A | 250 | 130 |
| 1 | Jan | South | A | 200 | 100 |
| 2 | Feb | North | A | 260 | 130 |
| 3 | Feb | South | A | 210 | 100 |
| 4 | Mar | North | A | 270 | 120 |
| 5 | Mar | South | A | 220 | 120 |
| 6 | Apr | North | B | 180 | 100 |
| 7 | Apr | South | B | 210 | 100 |
df_vertical.shape
(8, 5)
Convert the
df_verticalinto horizontal/wide format by using the Pandas methodpivot. This method requires three inputs:
indexcolumnsvalueswhich will determine the structure of the newDataFrame
df_vertical.pivot(index='Month', columns='Region', values='Sales')
| Region | North | South |
|---|---|---|
| Month | ||
| Apr | 180 | 210 |
| Feb | 260 | 210 |
| Jan | 250 | 200 |
| Mar | 270 | 220 |
In this format each row represents a month, each column represents a region, and the sales values are filled in accordingly. Here, the Month column becomes the index, the Region column values become the column names, and the Sales values are filled into the corresponding cells.
Note that the information in the columns Costs and Product is lost.
Pass a list into the input
valuesto obtain a pivot with several values on it. In this case, we are going to passSalesandCosts.
df_vertical.pivot(index='Month', columns='Region', values=['Sales', 'Costs'])
| Sales | Costs | |||
|---|---|---|---|---|
| Region | North | South | North | South |
| Month | ||||
| Apr | 180 | 210 | 100 | 100 |
| Feb | 260 | 210 | 130 | 100 |
| Jan | 250 | 200 | 130 | 100 |
| Mar | 270 | 220 | 120 | 120 |
Pass a list into the input
columnsto include the information about the product type
df_vertical.pivot(index='Month', columns=['Product', 'Region'], values=['Sales', 'Costs'])
| Sales | Costs | |||||||
|---|---|---|---|---|---|---|---|---|
| Product | A | B | A | B | ||||
| Region | North | South | North | South | North | South | North | South |
| Month | ||||||||
| Apr | NaN | NaN | 180.0 | 210.0 | NaN | NaN | 100.0 | 100.0 |
| Feb | 260.0 | 210.0 | NaN | NaN | 130.0 | 100.0 | NaN | NaN |
| Jan | 250.0 | 200.0 | NaN | NaN | 130.0 | 100.0 | NaN | NaN |
| Mar | 270.0 | 220.0 | NaN | NaN | 120.0 | 120.0 | NaN | NaN |
Similarly, you can pass a list into the input
indexto include the information about the product type
df_vertical.pivot(index=['Month','Product'], columns=['Region'], values=['Sales', 'Costs'])
| Sales | Costs | ||||
|---|---|---|---|---|---|
| Region | North | South | North | South | |
| Month | Product | ||||
| Apr | B | 180 | 210 | 100 | 100 |
| Feb | A | 260 | 210 | 130 | 100 |
| Jan | A | 250 | 200 | 130 | 100 |
| Mar | A | 270 | 220 | 120 | 120 |
Use melt to go from horizontal to vertical#
Take a look at the
df_horizontalobject
df_horizontal
| Month | North | South | |
|---|---|---|---|
| 0 | Jan | 250 | 200 |
| 1 | Feb | 260 | 210 |
| 2 | Mar | 270 | 220 |
| 3 | Apr | 180 | 210 |
Convert the
df_horizontalinto vertical/long format by using the Pandas functionmelt
pd.melt(df_horizontal)
| variable | value | |
|---|---|---|
| 0 | Month | Jan |
| 1 | Month | Feb |
| 2 | Month | Mar |
| 3 | Month | Apr |
| 4 | North | 250 |
| 5 | North | 260 |
| 6 | North | 270 |
| 7 | North | 180 |
| 8 | South | 200 |
| 9 | South | 210 |
| 10 | South | 220 |
| 11 | South | 210 |
To get a better result we need to pass the inputs
id_varsvalue_varsvar_namevalue_name
which will determine the structure of the new DataFrame
melted_df = pd.melt(df_horizontal, id_vars=['Month'], value_vars=['North', 'South'], var_name='Region', value_name='Sales')
melted_df
| Month | Region | Sales | |
|---|---|---|---|
| 0 | Jan | North | 250 |
| 1 | Feb | North | 260 |
| 2 | Mar | North | 270 |
| 3 | Apr | North | 180 |
| 4 | Jan | South | 200 |
| 5 | Feb | South | 210 |
| 6 | Mar | South | 220 |
| 7 | Apr | South | 210 |