Methods to reduce runtime memory while using Pandas, PyArrow, Scikit learn in Machine learning.

Published in

Analytics Vidhya

2 min readAug 8, 2019

Configure an arbitrary limit to the number of records used

While using PyArrow for converting parquet files to data frames, We may be deceived by the size of the actual parquet file. As 3 million rows of data may take less than 400MB of actual file memory. However, this data occupies up to 5 GB of processing memory when converted to data frame.The memory is released only when the process is completed.

In a multi-processing environment this may increase the stress on the processing memory, which may keep on increasing as we keep processing data and finally leading to high memory usage of the app server. Hence it's always better to configure a limit to the number of records fetched.

Use category type instead of Object type wherever possible

Features with "Object" datatype consume high memory and occupy different fragments in the RAM, compared to the "Category" datatype. Category fields internally encode the data with numbers and maintain a dictionary like structure to reference the actual values. Also, The number values of the category feature are stored in continuous memory locations. There by reducing the processing and lookup time taken.

Credits:https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

However, care should be taken while converting all object columns to category, as an object column with more than 50% unique values tend to occupy less memory as object type itself than being converted to category type.

Downcast numeric features:

Memory occupied by features of type

Int64 > Int32>Int16>Int8
Float64 > Float32 >Float16
Int > unsigned Int (only positive values)

Hence, where ever possible appropriate datatype should be chosen for each fields.
Caution should be taken while down casting, as data point values may get trimmed if they exceed the range of the datatype. However, we can adopt techniques like finding min, max values of a feature and decide to downcast or not. Also domain knowledge might help in deciding if the numeric field's range of values which in turn helps choose the datatype.

Scikit’s Label Encoder is time consuming:

Looks like Scikit-learn's Label encoder uses sorted search while fitting the data. This can be replaced by converting the object to categorical variable and then applying pandas cat.codes, which uses hashing technique and quite fast in encoding.As mentioned earlier converting a variable to a non object datatype is more performant in some cases.

Credits:
https://www.dataquest.io/blog/pandas-big-data/
https://stackoverflow.com/questions/39475187/how-to-speed-labelencoder-up-recoding-a-categorical-variable-into-integers?answertab=votes#tab-top

Analytics Vidhya

Methods to reduce runtime memory while using Pandas, PyArrow, Scikit learn in Machine learning.

Configure an arbitrary limit to the number of records used

Use category type instead of Object type wherever possible

Downcast numeric features:

Scikit’s Label Encoder is time consuming:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Analytics Vidhya

Written by Guruprasad

No responses yet