Naive-Bayes for mixed typed data in scikit-learn
If we look at the Naive Bayes (NB) implementations in scikit-learn we will be able to see quite a variety of NBs. To name a few …
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Categorical Naive Bayes
All the implementations are designed specifically to fit a particular type of data or distribution. Gaussian NB assumes your data to be independent and normally distributed, similarly Multinomial and categorical where the distribution is the reason through which we shall choose an implementation.
But in real time ML problems our datasets may contain a mixer of distributions like continuous, categorical , multinomial and bernoulli.
In that case how can we use NB ? Is there an implementation for it in scikit-learn?
Ans: No!
But can we do it ourself ?
Ans: Obviously! yes.
Should i implement it from scratch ?
Ans: No! We are going to make use of the existing scikit-learn implementations.
Ok let’s quickly recap the fundamentals of NB.
Each feature in the data set are independent to each other, hence using the property of independence we multiply the posterior probabilities of each feature along with the prior probability of the label to find the prediction probability of the label.
Let’s take multinomial NB as an example. Here
- P(W) — prior probability of word ( W= w¹,w²…)
- P(L) — prior probability of Label (L= l¹,l² …)
- P(w^k/l^n) — posterior probability of finding word w^k given the label is l^n. (where k = 1… total number of words, n=1… total number of labels)
Lets says we have a words sequence of
“w1 w2 w3 w4”
Probability of it belonging to label L1 is calculated by
π (Posterior of each word in the sequence given label L) * Prior of Label L
Note: π — Product of
P(L1) * P(w1/L1) * P(w2/L1) * P(w3/L1) * P(w4/L1)
we multiply the probabilities because we assume occurrence of one word is independent of the other. Hence these are independent events occurring at the same time.
Similarly, In our challenge of mixed data, if our data contains continuous and categorical values together, we follow the following steps.
1. Assume the data to be independent to each other.
2. Fit the continuous data to Gaussian NB. And get the probability/Log probability of occurring of this continuous data.
3. Fit the categorical data to Categorical NB. And get the probability/Log probability of occurring of this Categorical data.
4. Let’s get the log probabilities for steps (2) and (3). Let them be Log_Proba_Cont and Log_Proba_Cat.
5. Log_Proba_Cont and Log_Proba_Cat uses Prior probability of Label in their calculation. Hence we will subtract it once while adding both the probabilities.
6. i.e Final log probability of the entire dataset can be
Log_Proba_Cont + Log_Proba_Cat —P(L)7. This is equivalent to proba(cont) * proba(cat) / prior(Label). Log of the probas in line (6) converts multiplication to addition.
Let’s try to implement this in the Titanic dataset in kaggle.
Note: There is an ongoing P.R in scikit-learn repo for the aforementioned flow.
Reference :
https://stackoverflow.com/a/14255284