Notes on K-prototype for clustering mixed typed data

3 min readJun 13, 2019

The problem:

In the world of clustering data, k means is a popular algorithm. Where records are grouped (clustered) according to their Euclidean distance w.r.t the ‘k’ number of centroids. But not every time do we have only numeric values in our data points. There will be categorical values that are sometimes exhaustive or non- exhaustive list of values. The general approach is to encode those categorical values and run the k-means or any other distance-based / density-based clustering algorithms.

Intuitively, we can’t quantify a categorical value to a number and cluster them based on that number. And the output of the clustering directly depends on the random encoded value of each categorical variable.

Hence the need for an alternate approach to handle categorical data in a more intuitive way.

Solution:

Overview of the algorithm:

Just like K — means where we allocate the record to the closest centroid (a reference point to the cluster), here we allocate the record to the cluster which has the most similar looking reference point a.k.a prototype of the cluster a.k.a centroid of the cluster.

More than similarity the algorithms try to find the dissimilarity between data points and try to group points with less dissimilarity into a cluster.

The dissimilarity measure for numeric attributes is the square Euclidean distance whereas the similarity measure on categorical attributes is the number of matching attributes between objects and cluster prototypes.

D(x,p) = E(x,p) +λ C(x,p)
Where,
x = Any datapoint,
y = Prototype of a cluster,
D(x,p) = Dissimilarity measure between x and y,
E(x,p) = Euclidean distance between continuous attributes of x and y,
C(x,p) = number of mismatched categorical attributes between x and y,
λ= weightage for categorical variable value.

K — prototype:

The algorithm is built upon three steps
- Initial prototypes selection
- Initial allocation
- Re-allocation.

Initial prototypes selection:

Let’s assume that we know the value of K. Many variants of k prototype algorithm differs in the initial prototype selection step. The popular approaches are Huang and Cao approaches.
Huang suggests two initial mode selection methods,

Select the first k distinct objects from the data set as the initial k-modes.
Assign the most frequent categories equally to the initial k-modes.

while the Cao approach insists on selecting prototypes based on two points
The density of the data point and Dissimilarity value.

These two initializations in itself is a separate topic to discuss. We will do it in another post. For now, let us assume that we initialize the prototypes using one of the above-mentioned methods.

Initial allocation

Upon initializing the initial prototypes (centroids) we will check each data point w.r.t each centroid and assign the data point to the cluster whose centroid dissimilarity is the least.

After allocation, the centroid is updated by taking mode values of individual attributes for each data points in that cluster.

This is done for all the data points.

Re-allocation:

So after the initial allocation of records, one can realize that the initial centroids and the current centroid of a cluster would have changed(it may not as well. In a regular case it would have)

So with the current set of centroids, we will run step 2 again and reallocate data points to a different cluster if the new prototype shows more similarities.

Since cluster prototypes are recalculated after reallocation, This repeats until there is no Re-allocation of data points.

So this is the overview of the k prototype algorithm. Whose python implementation can be seen here

Conclusion:

It can be understood that

K-prototype = k-means for numeric + K-modes for categorical variables

P.S: The purpose of writing this article was to help myself understand the algorithm better. If it helped you or if there is any mistake in the above text kindly share your comments and feedbacks. Thanks for reading.

Credits:
https://github.com/nicodv/kmodes