1. Creating Series and Data Frames#
Pandas provides two types of classes for handling data:
Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.
DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.
In this recipe, you will learn how to create objects of this types.
How to do it#
Import both
pandas
andnumpy
libraries.
import pandas as pd
import numpy as np
Create an array of a given shape and populate it with random numbers from a uniform distribution over the interval \([0, 1)\). To do this use the method
rand
from thenumpy.random
module.
random_numbers = np.random.rand(5)
random_numbers
array([0.47946782, 0.94900802, 0.8003903 , 0.66087312, 0.5021559 ])
type(random_numbers)
numpy.ndarray
Series#
Create a pandas
Series
object from anumpy.ndarray
of random numbers.
series = pd.Series(random_numbers)
series
0 0.479468
1 0.949008
2 0.800390
3 0.660873
4 0.502156
dtype: float64
type(series)
pandas.core.series.Series
Create a pandas
Series
object from anumpy.ndarray
specifying theindex
series = pd.Series(random_numbers, index=["a", "b", "c", "d", "e"])
series
a 0.479468
b 0.949008
c 0.800390
d 0.660873
e 0.502156
dtype: float64
Create a pandas
Series
from a dictionary. In this case, the keys of the dictionary act as indices.
d = {"b": 1, "a": 0, "c": 2, "d":None}
series = pd.Series(d)
series
b 1.0
a 0.0
c 2.0
d NaN
dtype: float64
DataFrames#
From Dictionaries#
Create a
DataFrame
from a dictionary containing pandasSeries
objects.
d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
a | 1.0 | 1.0 |
b | 2.0 | 2.0 |
c | 3.0 | 3.0 |
d | NaN | 4.0 |
Create a
DataFrame
from a dictionary containing iterable elements such as lists, tuples, ornumpy.ndarrays
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
d = {"one": (1.0, 2.0, 3.0, 4.0), "two": (4.0, 3.0, 2.0, 1.0)}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
random_numbers1 = np.random.rand(5)
random_numbers2 = np.random.rand(5)
d = {"one": random_numbers1, "two": random_numbers2}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
0 | 0.902797 | 0.056748 |
1 | 0.203754 | 0.555891 |
2 | 0.884239 | 0.304098 |
3 | 0.991691 | 0.891247 |
4 | 0.223379 | 0.588790 |
From numpy arrays#
Create a
DataFrame
from a structured or recordnumpy.ndarrays
data = np.random.randn(3, 4)
df = pd.DataFrame(data)
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.739118 | 0.807602 | -2.480828 | 0.392597 |
1 | 0.261429 | -0.643314 | -0.381101 | 0.898246 |
2 | 0.868136 | -0.146715 | 0.550467 | -1.919859 |
There is more#
Some additional steps for your recipe.
Specify the names of the columns of the columns by passing the input
columns
data = np.random.randn(10, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
df
A | B | C | D | |
---|---|---|---|---|
0 | 0.476239 | 0.371352 | 0.568178 | 0.007894 |
1 | -0.277712 | -0.190171 | -1.526762 | -0.973880 |
2 | -0.630408 | 0.736540 | -1.075738 | 0.126177 |
3 | -1.749325 | 0.616581 | 1.408302 | 1.288096 |
4 | 0.782224 | -1.192311 | 0.527716 | -1.211001 |
5 | -0.647764 | -0.724451 | -1.019370 | -0.287565 |
6 | 0.639737 | 1.042487 | 0.730149 | -0.479395 |
7 | -0.731619 | -0.717644 | -0.697512 | 1.361714 |
8 | 2.158295 | -2.013561 | -0.891208 | -1.019687 |
9 | -1.035943 | -0.607005 | -0.487272 | 0.722944 |
Specify the index of the
DataFrame
by passing the inputindex
data = np.random.randn(10, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'], index=[
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'i'])
df
A | B | C | D | |
---|---|---|---|---|
a | 1.650168 | 0.893275 | -0.623417 | -1.449416 |
b | 0.567526 | -1.315214 | -1.026773 | -1.069463 |
c | -0.669923 | 1.250267 | 0.208683 | -0.108237 |
d | 0.416075 | -0.105983 | 0.628503 | -0.898255 |
e | -0.019420 | -0.616960 | -0.071081 | 1.074706 |
f | 0.136327 | 0.041658 | 1.506550 | 0.809719 |
g | 1.179197 | 0.150831 | 0.601135 | 1.176382 |
h | 1.944653 | -0.580342 | 0.336385 | -1.179848 |
j | 0.060012 | -0.097694 | 0.851866 | 0.126996 |
i | -0.897467 | -1.518786 | 2.172422 | -1.962632 |
Create a range of dates to be used as an index in our
DataFrame
by simply calling the methoddate_range
dates = pd.date_range(start='1/1/2024', periods=10)
data = np.random.randn(10, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'], index=dates)
df
A | B | C | D | |
---|---|---|---|---|
2024-01-01 | 0.081631 | 0.605092 | -0.577445 | 1.360987 |
2024-01-02 | 0.643335 | -0.418880 | 2.053832 | 0.185818 |
2024-01-03 | -0.868394 | 0.243802 | -1.390107 | -0.009417 |
2024-01-04 | -1.539327 | -0.288292 | -1.631790 | 1.616059 |
2024-01-05 | -0.064223 | -1.641774 | -0.567148 | 0.066072 |
2024-01-06 | 0.434569 | 0.020560 | -0.606185 | -2.128939 |
2024-01-07 | 1.228884 | 2.001144 | 0.066804 | 1.220431 |
2024-01-08 | 0.573809 | 0.445986 | -0.918571 | 0.033251 |
2024-01-09 | 1.113194 | 0.246670 | 0.084873 | 0.661690 |
2024-01-10 | -0.564881 | 0.206114 | 1.104971 | 0.011947 |