7. Handling Large Datasets Efficiently#
Handling large datasets efficiently in Python can be achieved by in different ways.
In this recipe we will focus on strategies to optimize to optimize performance and memory usage when managing data with pandas
.
How to do it#
Import the library
pandas
aspd
import pandas as pd
Processing Data by Chunks#
Read the data in chunks to avoid loading the entire dataset into memory at once.
chunk_size = 100 # Adjust based on available memory
chunks = pd.read_csv('data/customers-10000.csv', chunksize=chunk_size)
Use a generator to iterate over the data chunks and process them without loading everything in memory at once
for chunk in chunks:
print("I am processing a chunk")
# Some process for each chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
I am processing a chunk
Combine the results
Optimise the Data Types#
Using appropriate data types to reduce memory usage.
Check the data types of your
DataFrame
with the methoddtypes
df = chunk
df.dtypes
Index int64
Customer Id object
First Name object
Last Name object
Company object
City object
Country object
Phone 1 object
Phone 2 object
Email object
Subscription Date object
Website object
dtype: object
Downcast numeric columns. For example
float32
instead offloat64
.
Convert
object
columns to categories.
df['Country'].astype('category')
9900 Puerto Rico
9901 Russian Federation
9902 Tuvalu
9903 Comoros
9904 Argentina
...
9995 Cote d'Ivoire
9996 Namibia
9997 United States Virgin Islands
9998 Niger
9999 Tunisia
Name: Country, Length: 100, dtype: category
Categories (88, object): ['Afghanistan', 'Albania', 'Algeria', 'Angola', ..., 'Venezuela', 'Western Sahara', 'Zambia', 'Zimbabwe']
# cols = [ "list of columsn that you want to convert"]
# df[cols] = df[cols].astype('category')
There is more#
Use Efficient File Formats
Parquet: A columnar storage file format that is highly efficient for both read and write operations.
Feather: Optimized for speed, especially for data exchange between pandas and R.
HDF5: Suitable for storing large numerical data.
Use Libraries for distributed computing
Spark: Use PySpark for distributed data processing.
Dask: Parallelize computations on a single machine or across a cluster.