1. Creating Series and Data Frames#

Pandas provides two types of classes for handling data:

  • Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.

  • DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

In this recipe, you will learn how to create objects of this types.

How to do it#

  1. Import both pandas and numpy libraries.

import pandas as pd
import numpy as np
  1. Create an array of a given shape and populate it with random numbers from a uniform distribution over the interval \([0, 1)\). To do this use the method rand from the numpy.random module.

random_numbers = np.random.rand(5)
random_numbers
array([0.47946782, 0.94900802, 0.8003903 , 0.66087312, 0.5021559 ])
type(random_numbers)
numpy.ndarray

Series#

  1. Create a pandas Series object from a numpy.ndarray of random numbers.

series = pd.Series(random_numbers)
series
0    0.479468
1    0.949008
2    0.800390
3    0.660873
4    0.502156
dtype: float64
type(series)
pandas.core.series.Series
  1. Create a pandas Series object from a numpy.ndarray specifying the index

series = pd.Series(random_numbers, index=["a", "b", "c", "d", "e"])
series
a    0.479468
b    0.949008
c    0.800390
d    0.660873
e    0.502156
dtype: float64
  1. Create a pandas Series from a dictionary. In this case, the keys of the dictionary act as indices.

d = {"b": 1, "a": 0, "c": 2, "d":None}
series = pd.Series(d)
series
b    1.0
a    0.0
c    2.0
d    NaN
dtype: float64

DataFrames#

From Dictionaries#

  1. Create a DataFrame from a dictionary containing pandas Series objects.

d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
df
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
  1. Create a DataFrame from a dictionary containing iterable elements such as lists, tuples, or numpy.ndarrays

d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
df
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
d = {"one": (1.0, 2.0, 3.0, 4.0), "two": (4.0, 3.0, 2.0, 1.0)}
df = pd.DataFrame(d)
df
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
random_numbers1 = np.random.rand(5)
random_numbers2 = np.random.rand(5)
d = {"one": random_numbers1, "two": random_numbers2}
df = pd.DataFrame(d)
df
one two
0 0.902797 0.056748
1 0.203754 0.555891
2 0.884239 0.304098
3 0.991691 0.891247
4 0.223379 0.588790

From numpy arrays#

  1. Create a DataFrame from a structured or record numpy.ndarrays

data = np.random.randn(3, 4)
df = pd.DataFrame(data)
df
0 1 2 3
0 -0.739118 0.807602 -2.480828 0.392597
1 0.261429 -0.643314 -0.381101 0.898246
2 0.868136 -0.146715 0.550467 -1.919859

There is more#

Some additional steps for your recipe.

  1. Specify the names of the columns of the columns by passing the input columns

data = np.random.randn(10, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
df
A B C D
0 0.476239 0.371352 0.568178 0.007894
1 -0.277712 -0.190171 -1.526762 -0.973880
2 -0.630408 0.736540 -1.075738 0.126177
3 -1.749325 0.616581 1.408302 1.288096
4 0.782224 -1.192311 0.527716 -1.211001
5 -0.647764 -0.724451 -1.019370 -0.287565
6 0.639737 1.042487 0.730149 -0.479395
7 -0.731619 -0.717644 -0.697512 1.361714
8 2.158295 -2.013561 -0.891208 -1.019687
9 -1.035943 -0.607005 -0.487272 0.722944
  1. Specify the index of the DataFrame by passing the input index

data = np.random.randn(10, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'], index=[
                  'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'i'])
df
A B C D
a 1.650168 0.893275 -0.623417 -1.449416
b 0.567526 -1.315214 -1.026773 -1.069463
c -0.669923 1.250267 0.208683 -0.108237
d 0.416075 -0.105983 0.628503 -0.898255
e -0.019420 -0.616960 -0.071081 1.074706
f 0.136327 0.041658 1.506550 0.809719
g 1.179197 0.150831 0.601135 1.176382
h 1.944653 -0.580342 0.336385 -1.179848
j 0.060012 -0.097694 0.851866 0.126996
i -0.897467 -1.518786 2.172422 -1.962632
  1. Create a range of dates to be used as an index in our DataFrame by simply calling the method date_range

dates = pd.date_range(start='1/1/2024', periods=10)
data = np.random.randn(10, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'], index=dates)
df
A B C D
2024-01-01 0.081631 0.605092 -0.577445 1.360987
2024-01-02 0.643335 -0.418880 2.053832 0.185818
2024-01-03 -0.868394 0.243802 -1.390107 -0.009417
2024-01-04 -1.539327 -0.288292 -1.631790 1.616059
2024-01-05 -0.064223 -1.641774 -0.567148 0.066072
2024-01-06 0.434569 0.020560 -0.606185 -2.128939
2024-01-07 1.228884 2.001144 0.066804 1.220431
2024-01-08 0.573809 0.445986 -0.918571 0.033251
2024-01-09 1.113194 0.246670 0.084873 0.661690
2024-01-10 -0.564881 0.206114 1.104971 0.011947