Math & Data science enthusiast. I write on topics that i have researched and explored for my knowledge gain.

# Naive-Bayes for mixed typed data in scikit-learn

If we look at the Naive Bayes (NB) implementations in scikit-learn we will be able to see quite a variety of NBs. To name a few …

• Gaussian Naive Bayes
• Multinomial Naive Bayes
• Categorical Naive Bayes

All the implementations are designed specifically to fit a particular type of data or distribution. Gaussian NB assumes your data to be independent and normally distributed, similarly Multinomial and categorical where the distribution is the reason through which we shall choose an implementation.

But in real time ML problems our datasets may contain a mixer of distributions like continuous, categorical , multinomial and bernoulli.

# Killing an ant with a sword #1

I am going to play/practice with keras and neural network in this series of posts. We will make a model emulate some commonly used operations in real world.

Here i try to make the network lean a simple addition operation. With a very limited data the model was able to produce a decent result.

Addition is a linear operation which is a cake walk for the model to learn.
we will have positive as well as negative numbers for our training.

`import pandas as pdimport numpy as npimport osimport matplotlib.pyplot as pltimport numpy as npfrom keras.models import Sequentialfrom keras.layers import…`

# Notes on Sensitivity, Specificity, Precision,Recall and F1 score.

It is common to read blood reports that displays results like positive and negative for certain health condition tests.

Say the report states that they have tested negative for Dengue.

What does tested negative even mean ? Does it mean you have dengue or you don’t ?

It’s quite normal to have such confusions if we are not sure what’s the purpose of the test is for. If the medical test is done for checking the presence of dengue and if you tested negative it means you don’t have it.

Most of these tests are classifications.Especially binary classification. Where the…

# Methods to reduce runtime memory while using Pandas, PyArrow, Scikit learn in Machine learning.

## Configure an arbitrary limit to the number of records used

While using PyArrow for converting parquet files to data frames, We may be deceived by the size of the actual parquet file. As 3 million rows of data may take less than 400MB of actual file memory. However, this data occupies up to 5 GB of processing memory when converted to data frame.The memory is released only when the process is completed.

In a multi-processing environment this may increase the stress on the processing memory, which may keep on increasing as we keep processing data and finally leading to high memory usage of the app server. …

# The problem:

In the world of clustering data, k means is a popular algorithm. Where records are grouped (clustered) according to their Euclidean distance w.r.t the ‘k’ number of centroids. But not every time do we have only numeric values in our data points. There will be categorical values that are sometimes exhaustive or non- exhaustive list of values. The general approach is to encode those categorical values and run the k-means or any other distance-based / density-based clustering algorithms.

Intuitively, we can’t quantify a categorical value to a number and cluster them based on that number. …

# Unpacking Machine learning for beginners

This is part of a series of articles intended for people who want to know what exactly happens behind the scene in the world of Machine learning and Artificial Intelligence.

Making a machine to learn is more or less equal to teaching a 3-year-old kid. 