7. Handling Large Datasets Efficiently#

Handling large datasets efficiently in Python can be achieved by in different ways. In this recipe we will focus on strategies to optimize to optimize performance and memory usage when managing data with pandas.

How to do it#

  1. Import the library pandas as pd

import pandas as pd

Processing Data by Chunks#

  1. Read the data in chunks to avoid loading the entire dataset into memory at once.

chunk_size = 100  # Adjust based on available memory
chunks = pd.read_csv('data/customers-10000.csv', chunksize=chunk_size)
  1. Use a generator to iterate over the data chunks and process them without loading everything in memory at once

for chunk in chunks:
    print("I am processing a chunk")
    # Some process for each chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
  1. Combine the results

Optimise the Data Types#

Using appropriate data types to reduce memory usage.

  1. Check the data types of your DataFrame with the method dtypes

df = chunk
df.dtypes
Index                 int64
Customer Id          object
First Name           object
Last Name            object
Company              object
City                 object
Country              object
Phone 1              object
Phone 2              object
Email                object
Subscription Date    object
Website              object
dtype: object
  1. Downcast numeric columns. For example float32 instead of float64.

  1. Convert object columns to categories.

df['Country'].astype('category')
9900                     Puerto Rico
9901              Russian Federation
9902                          Tuvalu
9903                         Comoros
9904                       Argentina
                    ...             
9995                   Cote d'Ivoire
9996                         Namibia
9997    United States Virgin Islands
9998                           Niger
9999                         Tunisia
Name: Country, Length: 100, dtype: category
Categories (88, object): ['Afghanistan', 'Albania', 'Algeria', 'Angola', ..., 'Venezuela', 'Western Sahara', 'Zambia', 'Zimbabwe']
# cols = [ "list of columsn that you want to convert"]
# df[cols] = df[cols].astype('category')

There is more#

Use Efficient File Formats

  • Parquet: A columnar storage file format that is highly efficient for both read and write operations.

  • Feather: Optimized for speed, especially for data exchange between pandas and R.

  • HDF5: Suitable for storing large numerical data.

Use Libraries for distributed computing

  • Spark: Use PySpark for distributed data processing.

  • Dask: Parallelize computations on a single machine or across a cluster.