2. Selecting and Manipulating Data Frames#
Getting ready#
Import both
pandas
andnumpy
libraries
import pandas as pd
import numpy as np
Create a range of dates using
date_range
. We will use this as the index of ourDataFrame
.
dates = pd.date_range(start='1/1/2020', periods=8)
Create a
numpy
of size (8, 4) with random numbers. We will use these numbers as the content of ourDataFrame
data = np.random.randn(8, 4)
Create a
DataFrame
with index and contex given bydates
anddata
, respectively. We will specify the names of the columns via the inputcolumns
.
df = pd.DataFrame(data, index=dates, columns=['Product_A', 'Product_B', 'Product_C', 'Product_D'])
df
Product_A | Product_B | Product_C | Product_D | |
---|---|---|---|---|
2020-01-01 | -1.793309 | -0.397905 | 1.525082 | -0.471456 |
2020-01-02 | 0.566669 | -0.032985 | 0.690556 | 0.734600 |
2020-01-03 | 2.052083 | -0.021206 | 0.233416 | 0.869193 |
2020-01-04 | 0.817791 | -0.171998 | 0.433515 | 1.062368 |
2020-01-05 | 0.007632 | 1.228028 | 0.779766 | -0.489796 |
2020-01-06 | 0.172650 | 0.810259 | 0.172449 | 0.764088 |
2020-01-07 | 0.398666 | 0.960982 | -0.418191 | -0.795979 |
2020-01-08 | 0.632870 | 0.807580 | 0.422173 | 0.542766 |
How to do it#
Select a column of the
DataFrame
by simply specifying its name as follows
df['Product_A']
2020-01-01 -1.793309
2020-01-02 0.566669
2020-01-03 2.052083
2020-01-04 0.817791
2020-01-05 0.007632
2020-01-06 0.172650
2020-01-07 0.398666
2020-01-08 0.632870
Freq: D, Name: Product_A, dtype: float64
df.Product_A
2020-01-01 -1.793309
2020-01-02 0.566669
2020-01-03 2.052083
2020-01-04 0.817791
2020-01-05 0.007632
2020-01-06 0.172650
2020-01-07 0.398666
2020-01-08 0.632870
Freq: D, Name: Product_A, dtype: float64
Select a column by using the method
loc
df.loc[:,'Product_A']
2020-01-01 -1.793309
2020-01-02 0.566669
2020-01-03 2.052083
2020-01-04 0.817791
2020-01-05 0.007632
2020-01-06 0.172650
2020-01-07 0.398666
2020-01-08 0.632870
Freq: D, Name: Product_A, dtype: float64
Select several columns using the method
loc
df.loc[:,['Product_A', 'Product_B']]
Product_A | Product_B | |
---|---|---|
2020-01-01 | -1.793309 | -0.397905 |
2020-01-02 | 0.566669 | -0.032985 |
2020-01-03 | 2.052083 | -0.021206 |
2020-01-04 | 0.817791 | -0.171998 |
2020-01-05 | 0.007632 | 1.228028 |
2020-01-06 | 0.172650 | 0.810259 |
2020-01-07 | 0.398666 | 0.960982 |
2020-01-08 | 0.632870 | 0.807580 |
Select a set of rows and columns using the method
loc
df.loc[dates[5],['Product_A', 'Product_B']]
Product_A 0.172650
Product_B 0.810259
Name: 2020-01-06 00:00:00, dtype: float64
df.loc[dates[5] :,'Product_B': ]
Product_B | Product_C | Product_D | |
---|---|---|---|
2020-01-06 | 0.810259 | 0.172449 | 0.764088 |
2020-01-07 | 0.960982 | -0.418191 | -0.795979 |
2020-01-08 | 0.807580 | 0.422173 | 0.542766 |
Select a column by using
.iloc
. In this case, we have to pass the position of the column
df.iloc[:,1]
2020-01-01 -0.397905
2020-01-02 -0.032985
2020-01-03 -0.021206
2020-01-04 -0.171998
2020-01-05 1.228028
2020-01-06 0.810259
2020-01-07 0.960982
2020-01-08 0.807580
Freq: D, Name: Product_B, dtype: float64
Select several columns by using the method
iloc
df.iloc[:,1:]
Product_B | Product_C | Product_D | |
---|---|---|---|
2020-01-01 | -0.397905 | 1.525082 | -0.471456 |
2020-01-02 | -0.032985 | 0.690556 | 0.734600 |
2020-01-03 | -0.021206 | 0.233416 | 0.869193 |
2020-01-04 | -0.171998 | 0.433515 | 1.062368 |
2020-01-05 | 1.228028 | 0.779766 | -0.489796 |
2020-01-06 | 0.810259 | 0.172449 | 0.764088 |
2020-01-07 | 0.960982 | -0.418191 | -0.795979 |
2020-01-08 | 0.807580 | 0.422173 | 0.542766 |
Select a row by using the method
iloc
df.iloc[0,:]
Product_A -1.793309
Product_B -0.397905
Product_C 1.525082
Product_D -0.471456
Name: 2020-01-01 00:00:00, dtype: float64
Reset the index of the
DataFrame
df.reset_index()
index | Product_A | Product_B | Product_C | Product_D | |
---|---|---|---|---|---|
0 | 2020-01-01 | -1.793309 | -0.397905 | 1.525082 | -0.471456 |
1 | 2020-01-02 | 0.566669 | -0.032985 | 0.690556 | 0.734600 |
2 | 2020-01-03 | 2.052083 | -0.021206 | 0.233416 | 0.869193 |
3 | 2020-01-04 | 0.817791 | -0.171998 | 0.433515 | 1.062368 |
4 | 2020-01-05 | 0.007632 | 1.228028 | 0.779766 | -0.489796 |
5 | 2020-01-06 | 0.172650 | 0.810259 | 0.172449 | 0.764088 |
6 | 2020-01-07 | 0.398666 | 0.960982 | -0.418191 | -0.795979 |
7 | 2020-01-08 | 0.632870 | 0.807580 | 0.422173 | 0.542766 |
Note that this operation does not modify the original DataFrame
.
Verify this by inspecting it.
df
Product_A | Product_B | Product_C | Product_D | |
---|---|---|---|---|
2020-01-01 | -1.793309 | -0.397905 | 1.525082 | -0.471456 |
2020-01-02 | 0.566669 | -0.032985 | 0.690556 | 0.734600 |
2020-01-03 | 2.052083 | -0.021206 | 0.233416 | 0.869193 |
2020-01-04 | 0.817791 | -0.171998 | 0.433515 | 1.062368 |
2020-01-05 | 0.007632 | 1.228028 | 0.779766 | -0.489796 |
2020-01-06 | 0.172650 | 0.810259 | 0.172449 | 0.764088 |
2020-01-07 | 0.398666 | 0.960982 | -0.418191 | -0.795979 |
2020-01-08 | 0.632870 | 0.807580 | 0.422173 | 0.542766 |
If you want to modify the original DataFrame
, use the input inplace=True
.
df.reset_index(inplace=True)
df
index | Product_A | Product_B | Product_C | Product_D | |
---|---|---|---|---|---|
0 | 2020-01-01 | -1.793309 | -0.397905 | 1.525082 | -0.471456 |
1 | 2020-01-02 | 0.566669 | -0.032985 | 0.690556 | 0.734600 |
2 | 2020-01-03 | 2.052083 | -0.021206 | 0.233416 | 0.869193 |
3 | 2020-01-04 | 0.817791 | -0.171998 | 0.433515 | 1.062368 |
4 | 2020-01-05 | 0.007632 | 1.228028 | 0.779766 | -0.489796 |
5 | 2020-01-06 | 0.172650 | 0.810259 | 0.172449 | 0.764088 |
6 | 2020-01-07 | 0.398666 | 0.960982 | -0.418191 | -0.795979 |
7 | 2020-01-08 | 0.632870 | 0.807580 | 0.422173 | 0.542766 |