Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building…

Follow publication

Methods to reduce runtime memory while using Pandas, PyArrow, Scikit learn in Machine learning.

--

Photo by Xtina Yu on Unsplash

Configure an arbitrary limit to the number of records used

While using PyArrow for converting parquet files to data frames, We may be deceived by the size of the actual parquet file. As 3 million rows of data may take less than 400MB of actual file memory. However, this data occupies up to 5 GB of processing memory when converted to data frame.The memory is released only when the process is completed.

In a multi-processing environment this may increase the stress on the processing memory, which may keep on increasing as we keep processing data and finally leading to high memory usage of the app server. Hence it's always better to configure a limit to the number of records fetched.

Use category type instead of Object type wherever possible

Features with "Object" datatype consume high memory and occupy different fragments in the RAM, compared to the "Category" datatype. Category fields internally encode the data with numbers and maintain a dictionary like structure to reference the actual values. Also, The number values of the category feature are stored in continuous memory locations. There by reducing the processing and lookup time taken.

Credits:https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

However, care should be taken while converting all object columns to category, as an object column with more than 50% unique values tend to occupy less memory as object type itself than being converted to category type.

Downcast numeric features:

Memory occupied by features of type

Int64 > Int32>Int16>Int8
Float64 > Float32 >Float16
Int > unsigned Int (only positive values)

Hence, where ever possible appropriate datatype should be chosen for each fields.
Caution should be taken while down casting, as data point values may get trimmed if they exceed the range of the datatype. However, we can adopt techniques like finding min, max values of a feature and decide to downcast or not. Also domain knowledge might help in deciding if the numeric field's range of values which in turn helps choose the datatype.

Scikit’s Label Encoder is time consuming:

Looks like Scikit-learn's Label encoder uses sorted search while fitting the data. This can be replaced by converting the object to categorical variable and then applying pandas cat.codes, which uses hashing technique and quite fast in encoding.As mentioned earlier converting a variable to a non object datatype is more performant in some cases.

Credits:
https://www.dataquest.io/blog/pandas-big-data/
https://stackoverflow.com/questions/39475187/how-to-speed-labelencoder-up-recoding-a-categorical-variable-into-integers?answertab=votes#tab-top

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Guruprasad
Guruprasad

Written by Guruprasad

Math & Data science enthusiast. I write on topics that i have researched and explored for my knowledge gain.

No responses yet

Write a response