{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Selecting and Manipulating Data Frames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting ready\n", "\n", "1. Import both `pandas` and `numpy` libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Create a range of dates using `date_range`. We will use this as the index of our `DataFrame`.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "dates = pd.date_range(start='1/1/2020', periods=8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Create a `numpy` of size (8, 4) with random numbers. We will use these numbers as the content of our `DataFrame`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = np.random.randn(8, 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Create a `DataFrame` with index and contex given by `dates` and `data`, respectively. We will specify the names of the columns via the input `columns`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product_AProduct_BProduct_CProduct_D
2020-01-010.8816630.9876200.515382-0.244335
2020-01-02-0.367481-1.0716150.2167731.318869
2020-01-031.1742021.252871-0.5450090.077535
2020-01-04-1.007827-1.0939991.472493-0.493387
2020-01-05-0.9725150.119187-0.5638920.389504
2020-01-062.9050710.168398-0.679490-0.571970
2020-01-07-0.0702950.2694920.1077100.998263
2020-01-080.6575871.262432-0.2846170.465781
\n", "
" ], "text/plain": [ " Product_A Product_B Product_C Product_D\n", "2020-01-01 0.881663 0.987620 0.515382 -0.244335\n", "2020-01-02 -0.367481 -1.071615 0.216773 1.318869\n", "2020-01-03 1.174202 1.252871 -0.545009 0.077535\n", "2020-01-04 -1.007827 -1.093999 1.472493 -0.493387\n", "2020-01-05 -0.972515 0.119187 -0.563892 0.389504\n", "2020-01-06 2.905071 0.168398 -0.679490 -0.571970\n", "2020-01-07 -0.070295 0.269492 0.107710 0.998263\n", "2020-01-08 0.657587 1.262432 -0.284617 0.465781" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(data, index=dates, columns=['Product_A', 'Product_B', 'Product_C', 'Product_D'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to do it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Select a column of the `DataFrame` by simply specifying its name as follows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-01-01 0.881663\n", "2020-01-02 -0.367481\n", "2020-01-03 1.174202\n", "2020-01-04 -1.007827\n", "2020-01-05 -0.972515\n", "2020-01-06 2.905071\n", "2020-01-07 -0.070295\n", "2020-01-08 0.657587\n", "Freq: D, Name: Product_A, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Product_A']" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-01-01 0.881663\n", "2020-01-02 -0.367481\n", "2020-01-03 1.174202\n", "2020-01-04 -1.007827\n", "2020-01-05 -0.972515\n", "2020-01-06 2.905071\n", "2020-01-07 -0.070295\n", "2020-01-08 0.657587\n", "Freq: D, Name: Product_A, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Product_A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Select a column by using the method `loc`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-01-01 0.881663\n", "2020-01-02 -0.367481\n", "2020-01-03 1.174202\n", "2020-01-04 -1.007827\n", "2020-01-05 -0.972515\n", "2020-01-06 2.905071\n", "2020-01-07 -0.070295\n", "2020-01-08 0.657587\n", "Freq: D, Name: Product_A, dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:,'Product_A']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Select several columns using the method `loc`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product_AProduct_B
2020-01-010.8816630.987620
2020-01-02-0.367481-1.071615
2020-01-031.1742021.252871
2020-01-04-1.007827-1.093999
2020-01-05-0.9725150.119187
2020-01-062.9050710.168398
2020-01-07-0.0702950.269492
2020-01-080.6575871.262432
\n", "
" ], "text/plain": [ " Product_A Product_B\n", "2020-01-01 0.881663 0.987620\n", "2020-01-02 -0.367481 -1.071615\n", "2020-01-03 1.174202 1.252871\n", "2020-01-04 -1.007827 -1.093999\n", "2020-01-05 -0.972515 0.119187\n", "2020-01-06 2.905071 0.168398\n", "2020-01-07 -0.070295 0.269492\n", "2020-01-08 0.657587 1.262432" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:,['Product_A', 'Product_B']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Select a set of rows and columns using the method `loc`" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Product_A 2.905071\n", "Product_B 0.168398\n", "Name: 2020-01-06 00:00:00, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[dates[5],['Product_A', 'Product_B']]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product_BProduct_CProduct_D
2020-01-060.168398-0.679490-0.571970
2020-01-070.2694920.1077100.998263
2020-01-081.262432-0.2846170.465781
\n", "
" ], "text/plain": [ " Product_B Product_C Product_D\n", "2020-01-06 0.168398 -0.679490 -0.571970\n", "2020-01-07 0.269492 0.107710 0.998263\n", "2020-01-08 1.262432 -0.284617 0.465781" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[dates[5] :,'Product_B': ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Select a column by using `.iloc`. In this case, we have to pass the position of the column" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-01-01 0.987620\n", "2020-01-02 -1.071615\n", "2020-01-03 1.252871\n", "2020-01-04 -1.093999\n", "2020-01-05 0.119187\n", "2020-01-06 0.168398\n", "2020-01-07 0.269492\n", "2020-01-08 1.262432\n", "Freq: D, Name: Product_B, dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Select several columns by using the method `iloc`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product_BProduct_CProduct_D
2020-01-010.9876200.515382-0.244335
2020-01-02-1.0716150.2167731.318869
2020-01-031.252871-0.5450090.077535
2020-01-04-1.0939991.472493-0.493387
2020-01-050.119187-0.5638920.389504
2020-01-060.168398-0.679490-0.571970
2020-01-070.2694920.1077100.998263
2020-01-081.262432-0.2846170.465781
\n", "
" ], "text/plain": [ " Product_B Product_C Product_D\n", "2020-01-01 0.987620 0.515382 -0.244335\n", "2020-01-02 -1.071615 0.216773 1.318869\n", "2020-01-03 1.252871 -0.545009 0.077535\n", "2020-01-04 -1.093999 1.472493 -0.493387\n", "2020-01-05 0.119187 -0.563892 0.389504\n", "2020-01-06 0.168398 -0.679490 -0.571970\n", "2020-01-07 0.269492 0.107710 0.998263\n", "2020-01-08 1.262432 -0.284617 0.465781" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[:,1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Select a row by using the method `iloc`" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Product_A 0.881663\n", "Product_B 0.987620\n", "Product_C 0.515382\n", "Product_D -0.244335\n", "Name: 2020-01-01 00:00:00, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[0,:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "8. Reset the index of the `DataFrame`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexProduct_AProduct_BProduct_CProduct_D
02020-01-010.8816630.9876200.515382-0.244335
12020-01-02-0.367481-1.0716150.2167731.318869
22020-01-031.1742021.252871-0.5450090.077535
32020-01-04-1.007827-1.0939991.472493-0.493387
42020-01-05-0.9725150.119187-0.5638920.389504
52020-01-062.9050710.168398-0.679490-0.571970
62020-01-07-0.0702950.2694920.1077100.998263
72020-01-080.6575871.262432-0.2846170.465781
\n", "
" ], "text/plain": [ " index Product_A Product_B Product_C Product_D\n", "0 2020-01-01 0.881663 0.987620 0.515382 -0.244335\n", "1 2020-01-02 -0.367481 -1.071615 0.216773 1.318869\n", "2 2020-01-03 1.174202 1.252871 -0.545009 0.077535\n", "3 2020-01-04 -1.007827 -1.093999 1.472493 -0.493387\n", "4 2020-01-05 -0.972515 0.119187 -0.563892 0.389504\n", "5 2020-01-06 2.905071 0.168398 -0.679490 -0.571970\n", "6 2020-01-07 -0.070295 0.269492 0.107710 0.998263\n", "7 2020-01-08 0.657587 1.262432 -0.284617 0.465781" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this operation does not modify the original `DataFrame`. \n", "Verify this by inspecting it." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Product_AProduct_BProduct_CProduct_D
2020-01-010.8816630.9876200.515382-0.244335
2020-01-02-0.367481-1.0716150.2167731.318869
2020-01-031.1742021.252871-0.5450090.077535
2020-01-04-1.007827-1.0939991.472493-0.493387
2020-01-05-0.9725150.119187-0.5638920.389504
2020-01-062.9050710.168398-0.679490-0.571970
2020-01-07-0.0702950.2694920.1077100.998263
2020-01-080.6575871.262432-0.2846170.465781
\n", "
" ], "text/plain": [ " Product_A Product_B Product_C Product_D\n", "2020-01-01 0.881663 0.987620 0.515382 -0.244335\n", "2020-01-02 -0.367481 -1.071615 0.216773 1.318869\n", "2020-01-03 1.174202 1.252871 -0.545009 0.077535\n", "2020-01-04 -1.007827 -1.093999 1.472493 -0.493387\n", "2020-01-05 -0.972515 0.119187 -0.563892 0.389504\n", "2020-01-06 2.905071 0.168398 -0.679490 -0.571970\n", "2020-01-07 -0.070295 0.269492 0.107710 0.998263\n", "2020-01-08 0.657587 1.262432 -0.284617 0.465781" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to modify the original `DataFrame`, use the input `inplace=True`." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexProduct_AProduct_BProduct_CProduct_D
02020-01-010.8816630.9876200.515382-0.244335
12020-01-02-0.367481-1.0716150.2167731.318869
22020-01-031.1742021.252871-0.5450090.077535
32020-01-04-1.007827-1.0939991.472493-0.493387
42020-01-05-0.9725150.119187-0.5638920.389504
52020-01-062.9050710.168398-0.679490-0.571970
62020-01-07-0.0702950.2694920.1077100.998263
72020-01-080.6575871.262432-0.2846170.465781
\n", "
" ], "text/plain": [ " index Product_A Product_B Product_C Product_D\n", "0 2020-01-01 0.881663 0.987620 0.515382 -0.244335\n", "1 2020-01-02 -0.367481 -1.071615 0.216773 1.318869\n", "2 2020-01-03 1.174202 1.252871 -0.545009 0.077535\n", "3 2020-01-04 -1.007827 -1.093999 1.472493 -0.493387\n", "4 2020-01-05 -0.972515 0.119187 -0.563892 0.389504\n", "5 2020-01-06 2.905071 0.168398 -0.679490 -0.571970\n", "6 2020-01-07 -0.070295 0.269492 0.107710 0.998263\n", "7 2020-01-08 0.657587 1.262432 -0.284617 0.465781" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }