How to Handle Large CSV files with Pandas?

PandasServer Side ProgrammingProgramming

In this post, we will go through the options handling large CSV files with Pandas.CSV files are common containers of data, If you have a large CSV file that you want to process with pandas effectively, you have a few options.

Pandas is an in−memory tool

You need to be able to fit your data in memory to use pandas with it. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.

Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate. Extra memory should give you enough extra space to perform many of the common operations.

1.Check your system’s memory with Python

Let’s begin by checking our system’s memory. psutil will work on Windows, MAC, and Linux. psutil can be downloaded from Python’s package manager with pip install.

In case, you have compilation error with psutil while installing, try below steps. sudo yum install python3-devel sudo pip install psutil

now check the version using

pip freeze | grep psutil

Input

import psutil
memory = psutil.virtual_memory()
print(f" {'*' * 3} Memory used percentage - {memory.percent} \n {'*' * 4} Free Memory available - { round(memory.free / (1024.0 ** 3))} GB")

*** Memory used percentage − 64.4

**** Free Memory available − 6 GB

2. Determining the memory usage of our csv file

We will now estimate how much memory the whole file will take up. I have used the tmdb_5000_movies data set from kaggle.com.

Input

import pandas as pd

data = pd.read_csv("tmdb_5000_movies.csv")

# Lets check the memory usage of the file
print(f" ** Memory usage of the file - {sum(data.memory_usage()) * 0.000001} MB for {len(data.index)} Rows")

** Memory usage of the file - 8.453408 MB for 52833 Rows

The data.memory_usage() method shows the memory usage of our data frame while len(data.index) shows the total rows of data frame.

We can see that 52833 rows use about 8+ MB of memory. If we had a billion rows, that would take about 151+ GB of memory. Now, that’s a bad idea to fit every thing to memory and make it hang, don’t do it.

Now, let have a look at the limits of these data types.

Example

import numpy as np

# Limits of Integer Data Type
print(f" ** Output limits of Numpy Integer Data Types ")
print(f" ** limits of Numpy Integer - {np.iinfo(np.int8)}")
print(f" ** limits of Numpy Integer - {np.iinfo(np.int16)}")
print(f" ** limits of Numpy Integer - {np.iinfo(np.int64)}")

# Limits of Float Data Type
print(f" ** Output limits of Numpy Float Data Types ")
print(f" ** limits of Numpy Float - {np.finfo(np.float16)}")
print(f" ** limits of Numpy Float - {np.finfo(np.float64)}")

Output

** Output limits of Numpy Integer Data Types
** limits of Numpy Integer - Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------
** limits of Numpy Integer - Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------
** limits of Numpy Integer - Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------
** Output limits of Numpy Float Data Types
** limits of Numpy Float - Machine parameters for float16
---------------------------------------------------------------
precision = 3 resolution = 1.00040e-03
machep = -10 eps = 9.76562e-04
negep = -11 epsneg = 4.88281e-04
minexp = -14 tiny = 6.10352e-05
maxexp = 16 max = 6.55040e+04
nexp = 5 min = -max
---------------------------------------------------------------
** limits of Numpy Float - Machine parameters for float64
---------------------------------------------------------------
precision = 15 resolution = 1.0000000000000001e-15
machep = -52 eps = 2.2204460492503131e-16
negep = -53 epsneg = 1.1102230246251565e-16
minexp = -1022 tiny = 2.2250738585072014e-308
maxexp = 1024 max = 1.7976931348623157e+308
nexp = 11 min = -max
---------------------------------------------------------------

Input

**3.Converting Numeric Data Types**

Let’s run the .info() method to validate our data types in depth.
File "<ipython−input−17−aad3ab034212>", line 1
**3.Converting Numeric Data Types**
^
SyntaxError: invalid syntax

Input

# Lets print the DataFrame information
print(f" {data.info()}")

Now, lets summarize the data types and count of columns and see how pandas categories our data.

Input

# lets summarize the data types and count of columns
print(f" ** Summarize the data types and count of columns \n{data.dtypes.value_counts()}")

In this section, we will focus on the int64 and float64 data types, study the data/precision, and convert them. I will be using the dtype the parameter to tell pandas to use the smaller numeric types instead of the default 64bit, now you understand why the above step of understanding the data types first is important.

Input

# Define a dictionary converting the numeric data types
data_types_conversion_numeric = {
   "popularity": np.float16,
   "runtime": np.float16,
   "vote_average": np.float16,
   "id": np.int16,
   "revenue": np.int16,
   "vote_count": np.int16
}
data_02 = pd.read_csv("tmdb_5000_movies.csv", dtype=data_types_conversion_numeric)
print(f" ** Memory usage of the file - {sum(data_02.memory_usage()) * 0.000001} MB for {len(data_02.index)} Rows")

As you can see by changing the data types to use smaller numeric types have saved us 23% Savings, you can also use int8 if you are holding a small data which might result in further savings.

4.Converting Object Data Type

Object data types treat the values as strings. String values in pandas take up a bunch of memory as each value is stored as a Python string, If the column turns out to be non−numeric, pandas will convert it to an object column.

Converting Object data types to categoricals will use much less memory as Pandas only stores the string once, rather than creating new strings for every row.

First thing first, inspect the .value_counts method of the object columns. You can convert them to categorical columns to save even more memory If they are low cardinality,

Input

print(data_02.original_language.value_counts())

The cardinality is not that high, I will start converting Object data types to Category.

Input

data_types_conversion_numeric = {
   "popularity": np.float16,
   "runtime": np.float16,
   "vote_average": np.float16,
   "id": np.int16,
   "revenue": np.int16,
   "vote_count": np.int16,
   "genres": "category",
   "homepage": "category",
   "keywords": "category",
   "original_language": "category",
   "original_title": "category",
   "overview": "category",
   "production_companies": "category",
   "production_countries": "category",
   "release_date": "category",
   "spoken_languages": "category",
   "status": "category",
   "tagline": "category",
   "title": "category"
}
data_02 = pd.read_csv("tmdb_5000_movies.csv", dtype=data_types_conversion_numeric)
print(f" ** Memory usage of the file - {sum(data_02.memory_usage()) * 0.000001} MB for {len(data_02.index)} Rows")

We are now at 46% of the original size. That’s about 54% of memory saved.

5.Identifying and Dropping Duplicates in Pandas.

There is a good chance of possible duplicates in the source file you are handling, dropping them off if not required will save you some more memory. In my case, to make the file size grow big I have to repeat the records

Let’s validate the count of duplicate entires in our source file before dropping them.

Input

print(f" ** File has {len(data_02) - len(data_02.drop_duplicates())} duplicate rows off the total {len(data_02)} ")

len(your dataframe) outputs the total rows in the data frame while len(dataframe.drop_duplicates()) will output the unique values in the DataFrame. So my file as mentioned above have quite a number of duplicates, dropping them will save a bunch of memory.

Input

data_03 = data_02.drop_duplicates()
print(f" ** Memory usage of the file after dropping duplicates - {sum(data_03.memory_usage()) * 0.000001} MB for {len(data_03.index)} Rows")

Well, that’s some savings post deletion of duplicates. If you have duplicates and want to drop them off then use this step.

6.How to remove unwanted columns in Pandas

If there are columns that you know can be ignored, then specify usecols parameter to include the columns you want to load. Here, we will ignore the columns “homepage”, “keywords”, “original_title” and “tagline”.

Input

# prepare a list of columns that you want to load
unwanted_columns = ["homepage", "keywords","original_title", "tagline"]
data_columns = [columns for columns in list(pd.read_csv("tmdb_5000_movies.csv").columns) if columns not in unwanted_columns]

# Define a dictionary converting the numeric data types
data_types_conversion = {
   "popularity": np.float16,
   "runtime": np.float16,
   "vote_average": np.float16,
   "id": np.int16,
   "revenue": np.int16,
   "vote_count": np.int16,
   "genres": "category",
   "original_language": "category",
   "overview": "category",
   "production_companies": "category",
   "production_countries": "category",
   "release_date": "category",
   "spoken_languages": "category",
   "status": "category",
   "title": "category"
}
data_02 = pd.read_csv("tmdb_5000_movies.csv", dtype=data_types_conversion, usecols=data_columns)
print(f" ** Memory usage of the file after dropping cols - {sum(data_02.memory_usage()) * 0.000001} MB for {len(data_02.index)} Rows")

We are now at 32% of the original size. That’s about 68% of memory saved.

7.How to process chunks of the data with Pandas.

If you can process chunks of the data at a time and do not need all of it in memory, you can use the chunk size parameter. I, personally recommend this to be your last option.

# read the csv file
data = pd.read_csv("tmdb_5000_movies.csv")

# prepare a list of columns that you want to load
unwanted_columns = ["homepage", "keywords","original_title", "tagline"]
data_columns = [columns for columns in list(pd.read_csv("tmdb_5000_movies.csv").columns) if columns not in unwanted_columns]

# Define a dictionary converting the numeric data types
data_types_conversion = {
   "popularity": np.float16,
   "runtime": np.float16,
   "vote_average": np.float16,
   "id": np.int16,
   "revenue": np.int16,
   "vote_count": np.int16,
   "genres": "category",
   "original_language": "category",
   "overview": "category",
   "production_companies": "category",
   "production_countries": "category",
   "release_date": "category",
   "spoken_languages": "category",
   "status": "category",
   "title": "category"
}
data_02 = pd.read_csv("tmdb_5000_movies.csv", dtype=data_types_conversion, usecols=data_columns,chunksize=10000)

# Process the data frame in chunks
   for chunk in data_02:
print(f" ** Memory usage of the file after dropping cols − {sum(chunk.memory_usage()) * 0.000001} MB for {len(chunk.index)} Rows")
   print(f"Do some more processing here... ")

We are now at 14% of the original size. That’s about 86% of memory saved.

Note − use the .describe() method to keep comparing the results for each step.

raja
Published on 23-Oct-2020 13:50:18
Advertisements