Selecting and Manipulating Data Frames

2. Selecting and Manipulating Data Frames#

Getting ready#

Import both pandas and numpy libraries

import pandas as pd
import numpy as np

Create a range of dates using date_range. We will use this as the index of our DataFrame.

dates = pd.date_range(start='1/1/2020', periods=8)

Create a numpy of size (8, 4) with random numbers. We will use these numbers as the content of our DataFrame

data = np.random.randn(8, 4)

Create a DataFrame with index and contex given by dates and data, respectively. We will specify the names of the columns via the input columns.

df = pd.DataFrame(data, index=dates, columns=['Product_A', 'Product_B', 'Product_C', 'Product_D'])
df

	Product_A	Product_B	Product_C	Product_D
2020-01-01	-1.793309	-0.397905	1.525082	-0.471456
2020-01-02	0.566669	-0.032985	0.690556	0.734600
2020-01-03	2.052083	-0.021206	0.233416	0.869193
2020-01-04	0.817791	-0.171998	0.433515	1.062368
2020-01-05	0.007632	1.228028	0.779766	-0.489796
2020-01-06	0.172650	0.810259	0.172449	0.764088
2020-01-07	0.398666	0.960982	-0.418191	-0.795979
2020-01-08	0.632870	0.807580	0.422173	0.542766

How to do it#

Select a column of the DataFrame by simply specifying its name as follows

df['Product_A']

2020-01-01   -1.793309
2020-01-02    0.566669
2020-01-03    2.052083
2020-01-04    0.817791
2020-01-05    0.007632
2020-01-06    0.172650
2020-01-07    0.398666
2020-01-08    0.632870
Freq: D, Name: Product_A, dtype: float64

df.Product_A

2020-01-01   -1.793309
2020-01-02    0.566669
2020-01-03    2.052083
2020-01-04    0.817791
2020-01-05    0.007632
2020-01-06    0.172650
2020-01-07    0.398666
2020-01-08    0.632870
Freq: D, Name: Product_A, dtype: float64

Select a column by using the method loc

df.loc[:,'Product_A']

2020-01-01   -1.793309
2020-01-02    0.566669
2020-01-03    2.052083
2020-01-04    0.817791
2020-01-05    0.007632
2020-01-06    0.172650
2020-01-07    0.398666
2020-01-08    0.632870
Freq: D, Name: Product_A, dtype: float64

Select several columns using the method loc

df.loc[:,['Product_A', 'Product_B']]

	Product_A	Product_B
2020-01-01	-1.793309	-0.397905
2020-01-02	0.566669	-0.032985
2020-01-03	2.052083	-0.021206
2020-01-04	0.817791	-0.171998
2020-01-05	0.007632	1.228028
2020-01-06	0.172650	0.810259
2020-01-07	0.398666	0.960982
2020-01-08	0.632870	0.807580

Select a set of rows and columns using the method loc

df.loc[dates[5],['Product_A', 'Product_B']]

Product_A    0.172650
Product_B    0.810259
Name: 2020-01-06 00:00:00, dtype: float64

df.loc[dates[5] :,'Product_B': ]

	Product_B	Product_C	Product_D
2020-01-06	0.810259	0.172449	0.764088
2020-01-07	0.960982	-0.418191	-0.795979
2020-01-08	0.807580	0.422173	0.542766

Select a column by using .iloc. In this case, we have to pass the position of the column

df.iloc[:,1]

2020-01-01   -0.397905
2020-01-02   -0.032985
2020-01-03   -0.021206
2020-01-04   -0.171998
2020-01-05    1.228028
2020-01-06    0.810259
2020-01-07    0.960982
2020-01-08    0.807580
Freq: D, Name: Product_B, dtype: float64

Select several columns by using the method iloc

df.iloc[:,1:]

	Product_B	Product_C	Product_D
2020-01-01	-0.397905	1.525082	-0.471456
2020-01-02	-0.032985	0.690556	0.734600
2020-01-03	-0.021206	0.233416	0.869193
2020-01-04	-0.171998	0.433515	1.062368
2020-01-05	1.228028	0.779766	-0.489796
2020-01-06	0.810259	0.172449	0.764088
2020-01-07	0.960982	-0.418191	-0.795979
2020-01-08	0.807580	0.422173	0.542766

Select a row by using the method iloc

df.iloc[0,:]

Product_A   -1.793309
Product_B   -0.397905
Product_C    1.525082
Product_D   -0.471456
Name: 2020-01-01 00:00:00, dtype: float64

Reset the index of the DataFrame

df.reset_index()

	index	Product_A	Product_B	Product_C	Product_D
0	2020-01-01	-1.793309	-0.397905	1.525082	-0.471456
1	2020-01-02	0.566669	-0.032985	0.690556	0.734600
2	2020-01-03	2.052083	-0.021206	0.233416	0.869193
3	2020-01-04	0.817791	-0.171998	0.433515	1.062368
4	2020-01-05	0.007632	1.228028	0.779766	-0.489796
5	2020-01-06	0.172650	0.810259	0.172449	0.764088
6	2020-01-07	0.398666	0.960982	-0.418191	-0.795979
7	2020-01-08	0.632870	0.807580	0.422173	0.542766

Note that this operation does not modify the original DataFrame. Verify this by inspecting it.

df

	Product_A	Product_B	Product_C	Product_D
2020-01-01	-1.793309	-0.397905	1.525082	-0.471456
2020-01-02	0.566669	-0.032985	0.690556	0.734600
2020-01-03	2.052083	-0.021206	0.233416	0.869193
2020-01-04	0.817791	-0.171998	0.433515	1.062368
2020-01-05	0.007632	1.228028	0.779766	-0.489796
2020-01-06	0.172650	0.810259	0.172449	0.764088
2020-01-07	0.398666	0.960982	-0.418191	-0.795979
2020-01-08	0.632870	0.807580	0.422173	0.542766

If you want to modify the original DataFrame, use the input inplace=True.

df.reset_index(inplace=True)

df

	index	Product_A	Product_B	Product_C	Product_D
0	2020-01-01	-1.793309	-0.397905	1.525082	-0.471456
1	2020-01-02	0.566669	-0.032985	0.690556	0.734600
2	2020-01-03	2.052083	-0.021206	0.233416	0.869193
3	2020-01-04	0.817791	-0.171998	0.433515	1.062368
4	2020-01-05	0.007632	1.228028	0.779766	-0.489796
5	2020-01-06	0.172650	0.810259	0.172449	0.764088
6	2020-01-07	0.398666	0.960982	-0.418191	-0.795979
7	2020-01-08	0.632870	0.807580	0.422173	0.542766

Selecting and Manipulating Data Frames

Contents

2. Selecting and Manipulating Data Frames#

Getting ready#

How to do it#