Unpacking Machine learning for beginners

5 min readApr 25, 2019

This is part of a series of articles intended for people who want to know what exactly happens behind the scene in the world of Machine learning and Artificial Intelligence.

Making a machine to learn is more or less equal to teaching a 3-year-old kid.

In order to teach a kid to remember different fruits, their shape, and their name, we would show her a variety of fruits and let her know its name. Soon after this, to check how far has the kid learned, we show her a few more fruits and ask her to identify it. (In fact, this check is more of an evaluation of our teaching than the learning ability of the kid ;-)). If the learning is not up to the expected level we would adopt different strategies to make her learn. And this repeats.

Supervised machine learning belongs to a similar philosophy of learning i.e show me what is what, I will learn it as it is.

We are going to look at different stages in machine learning with a Deal prediction example.

Problem statement:

The clarity in what we are going to make the machine learn is an obvious requisite.
eg: We are going to build a system that predicts if a salesperson will win a particular deal or not.

Identifying contributing “Features”:

In this step, we should come with a list of features that we as a human think would help to predict the output of the problem. The domain knowledge of the problem is key here.
eg: features like
* Deal amount
* Person who is handling the deal
* Past purchases of the consumer
* Time taken for the past purchases
* The current Stage of the deal negotiation
* Lead source
* product/ service you try to sell
… etc

Here, not all features equally contribute to converting a deal, yet we should include them in the list. We never know, these features may individually not contribute but may holistically help.

Defining the target:

Here we should identify what we are going to predict as our output. This is called a target.
eg: We declare Deal win/lost as the target to our machine learning system.

Collecting data:

Whatever the machine is going to predict directly depends on the quality of data being collected. We should ensure that we collect the data of all the contributing features and the target. This can be done in many ways like taking a survey or using historical data from a database or any possible way.
eg: We can take the past one year record of all the deals won/lost and their corresponding features as our data

Cleaning the data if necessary:

Most real-time data are not outright suitable for machine learning. There will be a lot of junk, repetitive and missing values in the data. We should take a stance on how we treat the messiness in the data. We can drop, fill with some arbitrary value or fill them using any statistical techniques like mean, mode, etc. It is left to the individual decision based on the problem in hand.
eg: We will check for duplicate deals if any and drop them, Also fill the missing values if any manually.

Identifying the important features:

We can shortlist the features with our judgment. But we can’t assert how much they contribute in solving the problem. So we will apply statistical techniques on each feature’s data individually as well as collectively.
Also, we will check it’s relevance to the target variable as well. We will finally arrive at a list of features that actually contribute to the output.

Some statistical techniques employed are Univariate analysis, Multivariate analysis, Correlation, Chi-square test of independence, etc..
eg: We apply a few of the above techniques and drill down our features list to the below 5.

* Deal amount
* Person who is handling the deal
* The current Stage of the deal negotiation
* Lead source
* product name

Identifying a suitable algorithm or devise a logic :

Majority of machine learning problems can be classified into Regression or classification. In simple terms if your target variable can predict values from an exhaustive list, it is called classification else it is called regression.

There are a lot of algorithms under these 2 categories which can be tuned with parameters to suit the problem in hand. This is called hyper-parameter tuning. And explaining them is out of the scope of this article.
eg: Here our output variable is a boolean i.e Deal will win or lose. Hence it is a classification problem. We choose a classification algorithm called random-forest.

Training->cross-validation->testing

(The place where the machine is expected to learn):
Data of all the important features are fed into the algorithm in a suitable format and the algorithm is tuned with hyperparameters. This results in the generation of the model(Assume it to be a function capable of predicting the output). We would split the data into 3. Train, Cross-validation and test sets. We cross-validate the trained model for accuracy and retrain them until the training accuracy improves. Later we test it with test data for the actual accuracy of the model.

eg: We feed the 5 important columns to RF mentioning RF’s parameters like the number of trees, max depth, etc.. Which results in the Deal prediction model.
Which after cross-validation and testing is capable of predicting if a deal will win or lose based on the values of the 5 important columns of the deal.

Be aware that whatever goes into the training data is the only source of reference for the model to take a decision. The prediction is as good as the data being used.

The entire learning is human-assisted here, i.e the individual’s domain knowledge has been assisting the machine to identify features, clean the data and identify the target. Hence the name supervised machine learning.

Actually, machines can be made to learn a few other things without human assistance, they are called Unsupervised machine learning. We will discuss it in the next article!

Feedback about the article is much appreciated. Thank you.