2. Selecting and Manipulating Data Frames#

Getting ready#

  1. Import both pandas and numpy libraries

import pandas as pd
import numpy as np
  1. Create a range of dates using date_range. We will use this as the index of our DataFrame.

dates = pd.date_range(start='1/1/2020', periods=8)
  1. Create a numpy of size (8, 4) with random numbers. We will use these numbers as the content of our DataFrame

data = np.random.randn(8, 4)
  1. Create a DataFrame with index and contex given by dates and data, respectively. We will specify the names of the columns via the input columns.

df = pd.DataFrame(data, index=dates, columns=['Product_A', 'Product_B', 'Product_C', 'Product_D'])
df
Product_A Product_B Product_C Product_D
2020-01-01 -1.793309 -0.397905 1.525082 -0.471456
2020-01-02 0.566669 -0.032985 0.690556 0.734600
2020-01-03 2.052083 -0.021206 0.233416 0.869193
2020-01-04 0.817791 -0.171998 0.433515 1.062368
2020-01-05 0.007632 1.228028 0.779766 -0.489796
2020-01-06 0.172650 0.810259 0.172449 0.764088
2020-01-07 0.398666 0.960982 -0.418191 -0.795979
2020-01-08 0.632870 0.807580 0.422173 0.542766

How to do it#

  1. Select a column of the DataFrame by simply specifying its name as follows

df['Product_A']
2020-01-01   -1.793309
2020-01-02    0.566669
2020-01-03    2.052083
2020-01-04    0.817791
2020-01-05    0.007632
2020-01-06    0.172650
2020-01-07    0.398666
2020-01-08    0.632870
Freq: D, Name: Product_A, dtype: float64
df.Product_A
2020-01-01   -1.793309
2020-01-02    0.566669
2020-01-03    2.052083
2020-01-04    0.817791
2020-01-05    0.007632
2020-01-06    0.172650
2020-01-07    0.398666
2020-01-08    0.632870
Freq: D, Name: Product_A, dtype: float64
  1. Select a column by using the method loc

df.loc[:,'Product_A']
2020-01-01   -1.793309
2020-01-02    0.566669
2020-01-03    2.052083
2020-01-04    0.817791
2020-01-05    0.007632
2020-01-06    0.172650
2020-01-07    0.398666
2020-01-08    0.632870
Freq: D, Name: Product_A, dtype: float64
  1. Select several columns using the method loc

df.loc[:,['Product_A', 'Product_B']]
Product_A Product_B
2020-01-01 -1.793309 -0.397905
2020-01-02 0.566669 -0.032985
2020-01-03 2.052083 -0.021206
2020-01-04 0.817791 -0.171998
2020-01-05 0.007632 1.228028
2020-01-06 0.172650 0.810259
2020-01-07 0.398666 0.960982
2020-01-08 0.632870 0.807580
  1. Select a set of rows and columns using the method loc

df.loc[dates[5],['Product_A', 'Product_B']]
Product_A    0.172650
Product_B    0.810259
Name: 2020-01-06 00:00:00, dtype: float64
df.loc[dates[5] :,'Product_B': ]
Product_B Product_C Product_D
2020-01-06 0.810259 0.172449 0.764088
2020-01-07 0.960982 -0.418191 -0.795979
2020-01-08 0.807580 0.422173 0.542766
  1. Select a column by using .iloc. In this case, we have to pass the position of the column

df.iloc[:,1]
2020-01-01   -0.397905
2020-01-02   -0.032985
2020-01-03   -0.021206
2020-01-04   -0.171998
2020-01-05    1.228028
2020-01-06    0.810259
2020-01-07    0.960982
2020-01-08    0.807580
Freq: D, Name: Product_B, dtype: float64
  1. Select several columns by using the method iloc

df.iloc[:,1:]
Product_B Product_C Product_D
2020-01-01 -0.397905 1.525082 -0.471456
2020-01-02 -0.032985 0.690556 0.734600
2020-01-03 -0.021206 0.233416 0.869193
2020-01-04 -0.171998 0.433515 1.062368
2020-01-05 1.228028 0.779766 -0.489796
2020-01-06 0.810259 0.172449 0.764088
2020-01-07 0.960982 -0.418191 -0.795979
2020-01-08 0.807580 0.422173 0.542766
  1. Select a row by using the method iloc

df.iloc[0,:]
Product_A   -1.793309
Product_B   -0.397905
Product_C    1.525082
Product_D   -0.471456
Name: 2020-01-01 00:00:00, dtype: float64
  1. Reset the index of the DataFrame

df.reset_index()
index Product_A Product_B Product_C Product_D
0 2020-01-01 -1.793309 -0.397905 1.525082 -0.471456
1 2020-01-02 0.566669 -0.032985 0.690556 0.734600
2 2020-01-03 2.052083 -0.021206 0.233416 0.869193
3 2020-01-04 0.817791 -0.171998 0.433515 1.062368
4 2020-01-05 0.007632 1.228028 0.779766 -0.489796
5 2020-01-06 0.172650 0.810259 0.172449 0.764088
6 2020-01-07 0.398666 0.960982 -0.418191 -0.795979
7 2020-01-08 0.632870 0.807580 0.422173 0.542766

Note that this operation does not modify the original DataFrame. Verify this by inspecting it.

df
Product_A Product_B Product_C Product_D
2020-01-01 -1.793309 -0.397905 1.525082 -0.471456
2020-01-02 0.566669 -0.032985 0.690556 0.734600
2020-01-03 2.052083 -0.021206 0.233416 0.869193
2020-01-04 0.817791 -0.171998 0.433515 1.062368
2020-01-05 0.007632 1.228028 0.779766 -0.489796
2020-01-06 0.172650 0.810259 0.172449 0.764088
2020-01-07 0.398666 0.960982 -0.418191 -0.795979
2020-01-08 0.632870 0.807580 0.422173 0.542766

If you want to modify the original DataFrame, use the input inplace=True.

df.reset_index(inplace=True)
df
index Product_A Product_B Product_C Product_D
0 2020-01-01 -1.793309 -0.397905 1.525082 -0.471456
1 2020-01-02 0.566669 -0.032985 0.690556 0.734600
2 2020-01-03 2.052083 -0.021206 0.233416 0.869193
3 2020-01-04 0.817791 -0.171998 0.433515 1.062368
4 2020-01-05 0.007632 1.228028 0.779766 -0.489796
5 2020-01-06 0.172650 0.810259 0.172449 0.764088
6 2020-01-07 0.398666 0.960982 -0.418191 -0.795979
7 2020-01-08 0.632870 0.807580 0.422173 0.542766