3. Creating Columns as Functions of Existing Columns#
Getting ready#
Import pandas
and numpy
and create a DataFrame
to work with
import pandas as pd
import numpy as np
dates = pd.date_range('1/1/2020', periods=8)
data = np.random.randn(8, 3)
df = pd.DataFrame(data,
index=dates, columns=['A', 'B', 'C'])
df.head()
A | B | C | |
---|---|---|---|
2020-01-01 | 1.261216 | -0.344784 | -0.735607 |
2020-01-02 | 1.549176 | 0.340301 | -0.567190 |
2020-01-03 | 0.001756 | -0.091296 | -0.954124 |
2020-01-04 | -0.733245 | 0.337312 | 1.344126 |
2020-01-05 | 0.186424 | -0.191783 | -0.144798 |
How to do it#
Using vectorized operations#
Create a new column as a function of an existing column
df['Squared_A'] = df['A']**2
df['CosA'] = np.cos(df['A'])
Create a new column as a function of two or more existing columns using vectorized operation such as product and sum
df['Sum_AB'] = df['A'] + df['B']
df['Prod_AB'] = df['A'] * df['B']
df
A | B | C | Squared_A | CosA | Sum_AB | Prod_AB | |
---|---|---|---|---|---|---|---|
2020-01-01 | 1.261216 | -0.344784 | -0.735607 | 1.590666 | 0.304659 | 0.916432 | -0.434847 |
2020-01-02 | 1.549176 | 0.340301 | -0.567190 | 2.399947 | 0.021618 | 1.889478 | 0.527187 |
2020-01-03 | 0.001756 | -0.091296 | -0.954124 | 0.000003 | 0.999998 | -0.089539 | -0.000160 |
2020-01-04 | -0.733245 | 0.337312 | 1.344126 | 0.537648 | 0.743006 | -0.395933 | -0.247332 |
2020-01-05 | 0.186424 | -0.191783 | -0.144798 | 0.034754 | 0.982673 | -0.005359 | -0.035753 |
2020-01-06 | -0.087005 | -1.612554 | 0.012796 | 0.007570 | 0.996217 | -1.699559 | 0.140300 |
2020-01-07 | -1.309805 | 0.364629 | -0.506853 | 1.715590 | 0.258038 | -0.945176 | -0.477593 |
2020-01-08 | -1.160022 | 0.063692 | -0.551563 | 1.345651 | 0.399320 | -1.096330 | -0.073884 |
Using apply
#
Create a new column evaluating an ad-hoc function on existing columns. To do this we need to
Define the function that we want to evaluate
Use the method
apply
combined with alambda
expression
from math import sin, cos
def my_function(x, y, z):
if x + y >3:
return sin(x) + sin(y) + sin(z)
else:
return cos(x) + cos(y) + cos(z)
df['function_ABC'] = df.apply(lambda x: my_function(x['A'], x['B'], x['C']), axis=1)
df
A | B | C | Squared_A | CosA | Sum_AB | Prod_AB | function_ABC | |
---|---|---|---|---|---|---|---|---|
2020-01-01 | 1.261216 | -0.344784 | -0.735607 | 1.590666 | 0.304659 | 0.916432 | -0.434847 | 1.987231 |
2020-01-02 | 1.549176 | 0.340301 | -0.567190 | 2.399947 | 0.021618 | 1.889478 | 0.527187 | 1.807687 |
2020-01-03 | 0.001756 | -0.091296 | -0.954124 | 0.000003 | 0.999998 | -0.089539 | -0.000160 | 2.574158 |
2020-01-04 | -0.733245 | 0.337312 | 1.344126 | 0.537648 | 0.743006 | -0.395933 | -0.247332 | 1.911389 |
2020-01-05 | 0.186424 | -0.191783 | -0.144798 | 0.034754 | 0.982673 | -0.005359 | -0.035753 | 2.953874 |
2020-01-06 | -0.087005 | -1.612554 | 0.012796 | 0.007570 | 0.996217 | -1.699559 | 0.140300 | 1.954390 |
2020-01-07 | -1.309805 | 0.364629 | -0.506853 | 1.715590 | 0.258038 | -0.945176 | -0.477593 | 2.066571 |
2020-01-08 | -1.160022 | 0.063692 | -0.551563 | 1.345651 | 0.399320 | -1.096330 | -0.073884 | 2.248999 |
Create a column which maps the values of a column to a set of colors depending on some boundaries. For example, we are going to map column
C
as follows:
‘black’ for values < -3
‘yellow’ for values in [-3, 3]
‘green’ for values >3
def color_function(x):
if x < -3:
return 'black'
elif x > 3:
return 'green'
else:
return 'yellow'
df['color'] = df.apply(lambda x: color_function(x['C']), axis=1)
df
A | B | C | Squared_A | CosA | Sum_AB | Prod_AB | function_ABC | color | |
---|---|---|---|---|---|---|---|---|---|
2020-01-01 | 1.261216 | -0.344784 | -0.735607 | 1.590666 | 0.304659 | 0.916432 | -0.434847 | 1.987231 | yellow |
2020-01-02 | 1.549176 | 0.340301 | -0.567190 | 2.399947 | 0.021618 | 1.889478 | 0.527187 | 1.807687 | yellow |
2020-01-03 | 0.001756 | -0.091296 | -0.954124 | 0.000003 | 0.999998 | -0.089539 | -0.000160 | 2.574158 | yellow |
2020-01-04 | -0.733245 | 0.337312 | 1.344126 | 0.537648 | 0.743006 | -0.395933 | -0.247332 | 1.911389 | yellow |
2020-01-05 | 0.186424 | -0.191783 | -0.144798 | 0.034754 | 0.982673 | -0.005359 | -0.035753 | 2.953874 | yellow |
2020-01-06 | -0.087005 | -1.612554 | 0.012796 | 0.007570 | 0.996217 | -1.699559 | 0.140300 | 1.954390 | yellow |
2020-01-07 | -1.309805 | 0.364629 | -0.506853 | 1.715590 | 0.258038 | -0.945176 | -0.477593 | 2.066571 | yellow |
2020-01-08 | -1.160022 | 0.063692 | -0.551563 | 1.345651 | 0.399320 | -1.096330 | -0.073884 | 2.248999 | yellow |
Using conditions#
Create a column using a condition. For example, checking of the value of a column is positive or negative.
df['Flag'] = df['A'] >= 0
df
A | B | C | Squared_A | CosA | Sum_AB | Prod_AB | function_ABC | color | Flag | |
---|---|---|---|---|---|---|---|---|---|---|
2020-01-01 | 1.261216 | -0.344784 | -0.735607 | 1.590666 | 0.304659 | 0.916432 | -0.434847 | 1.987231 | yellow | True |
2020-01-02 | 1.549176 | 0.340301 | -0.567190 | 2.399947 | 0.021618 | 1.889478 | 0.527187 | 1.807687 | yellow | True |
2020-01-03 | 0.001756 | -0.091296 | -0.954124 | 0.000003 | 0.999998 | -0.089539 | -0.000160 | 2.574158 | yellow | True |
2020-01-04 | -0.733245 | 0.337312 | 1.344126 | 0.537648 | 0.743006 | -0.395933 | -0.247332 | 1.911389 | yellow | False |
2020-01-05 | 0.186424 | -0.191783 | -0.144798 | 0.034754 | 0.982673 | -0.005359 | -0.035753 | 2.953874 | yellow | True |
2020-01-06 | -0.087005 | -1.612554 | 0.012796 | 0.007570 | 0.996217 | -1.699559 | 0.140300 | 1.954390 | yellow | False |
2020-01-07 | -1.309805 | 0.364629 | -0.506853 | 1.715590 | 0.258038 | -0.945176 | -0.477593 | 2.066571 | yellow | False |
2020-01-08 | -1.160022 | 0.063692 | -0.551563 | 1.345651 | 0.399320 | -1.096330 | -0.073884 | 2.248999 | yellow | False |
Using map
#
Map a column with boolean values to color names in string format. For example, map
True
to color ‘green’ andFalse
to color ‘red’.
df['Flag_to_Colors'] = df['Flag'].map({True:'green', False:'red'})
df
A | B | C | Squared_A | CosA | Sum_AB | Prod_AB | function_ABC | color | Flag | Flag_to_Colors | |
---|---|---|---|---|---|---|---|---|---|---|---|
2020-01-01 | 1.261216 | -0.344784 | -0.735607 | 1.590666 | 0.304659 | 0.916432 | -0.434847 | 1.987231 | yellow | True | green |
2020-01-02 | 1.549176 | 0.340301 | -0.567190 | 2.399947 | 0.021618 | 1.889478 | 0.527187 | 1.807687 | yellow | True | green |
2020-01-03 | 0.001756 | -0.091296 | -0.954124 | 0.000003 | 0.999998 | -0.089539 | -0.000160 | 2.574158 | yellow | True | green |
2020-01-04 | -0.733245 | 0.337312 | 1.344126 | 0.537648 | 0.743006 | -0.395933 | -0.247332 | 1.911389 | yellow | False | red |
2020-01-05 | 0.186424 | -0.191783 | -0.144798 | 0.034754 | 0.982673 | -0.005359 | -0.035753 | 2.953874 | yellow | True | green |
2020-01-06 | -0.087005 | -1.612554 | 0.012796 | 0.007570 | 0.996217 | -1.699559 | 0.140300 | 1.954390 | yellow | False | red |
2020-01-07 | -1.309805 | 0.364629 | -0.506853 | 1.715590 | 0.258038 | -0.945176 | -0.477593 | 2.066571 | yellow | False | red |
2020-01-08 | -1.160022 | 0.063692 | -0.551563 | 1.345651 | 0.399320 | -1.096330 | -0.073884 | 2.248999 | yellow | False | red |