Sign in

Math & Data science enthusiast. I write on topics that i have researched and explored for my knowledge gain.

If we look at the Naive Bayes (NB) implementations in scikit-learn we will be able to see quite a variety of NBs. To name a few …

  • Gaussian Naive Bayes
  • Multinomial Naive Bayes
  • Categorical Naive Bayes

All the implementations are designed specifically to fit a particular type of data or distribution. Gaussian NB assumes your data to be independent and normally distributed, similarly Multinomial and categorical where the distribution is the reason through which we shall choose an implementation.

But in real time ML problems our datasets may contain a mixer of distributions like continuous, categorical , multinomial and bernoulli.

Photo by Usman Yousaf on Unsplash

In that case how can we use NB ? Is there an implementation for it in scikit-learn?

I am going to play/practice with keras and neural network in this series of posts. We will make a model emulate some commonly used operations in real world.

Here i try to make the network lean a simple addition operation. With a very limited data the model was able to produce a decent result.

Addition is a linear operation which is a cake walk for the model to learn.
we will have positive as well as negative numbers for our training.

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import numpy as np
from keras.models import Sequential
from keras.layers import…

It is common to read blood reports that displays results like positive and negative for certain health condition tests.

Say the report states that they have tested negative for Dengue.

What does tested negative even mean ? Does it mean you have dengue or you don’t ?

It’s quite normal to have such confusions if we are not sure what’s the purpose of the test is for. If the medical test is done for checking the presence of dengue and if you tested negative it means you don’t have it.

Most of these tests are classifications.Especially binary classification. Where the…

Photo by Xtina Yu on Unsplash

Configure an arbitrary limit to the number of records used

While using PyArrow for converting parquet files to data frames, We may be deceived by the size of the actual parquet file. As 3 million rows of data may take less than 400MB of actual file memory. However, this data occupies up to 5 GB of processing memory when converted to data frame.The memory is released only when the process is completed.

In a multi-processing environment this may increase the stress on the processing memory, which may keep on increasing as we keep processing data and finally leading to high memory usage of the app server. …

The problem:

In the world of clustering data, k means is a popular algorithm. Where records are grouped (clustered) according to their Euclidean distance w.r.t the ‘k’ number of centroids. But not every time do we have only numeric values in our data points. There will be categorical values that are sometimes exhaustive or non- exhaustive list of values. The general approach is to encode those categorical values and run the k-means or any other distance-based / density-based clustering algorithms.

Intuitively, we can’t quantify a categorical value to a number and cluster them based on that number. …

This is part of a series of articles intended for people who want to know what exactly happens behind the scene in the world of Machine learning and Artificial Intelligence.

Making a machine to learn is more or less equal to teaching a 3-year-old kid.

pic courtesy:

In order to teach a kid to remember different fruits, their shape, and their name, we would show her a variety of fruits and let her know its name. Soon after this, to check how far has the kid learned, we show her a few more fruits and ask her to identify it. (In fact…


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store