Materials and Assignments

Resources for today's materials can be downloaded from this link. Assignment notebooks are already included in zip file. You can also directly access to Assignment 5 Google Colab notebook from this link.

Dataset for Assignment (5)

Christmas trees dataset is used for this assignment. You can download the dataset separately from this link. ~50 images are left unlabelled in the dataset and you will need to label them manually.

Assignment Submission

Week 3 assignment deadline is 30th December 2020, 11:59 A.M. Submit your assignments from this link.

Unsupervised Learning : k-Means Clustering

One of the most widely used methods of clustering is k-means, which also tends to be a rather basic model.

Import libraries

Let's import some libraries necessary for our experiments

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np
from sklearn.model_selection import train_test_split
import cv2
import pandas as pd
import random

Straight Line

First, let's run k-means on a straight line to see what happens. Here, we have generated 30 data points on X axis.

X1 = 5*np.random.rand(30,1)
X2 = np.zeros((30,1))
# X1.reshape(-1, 1)
# X2.reshape(-1, 1)
plt.scatter(X1,X2)
X = np.hstack((X1,X2))

Making sure the samples are on the first dimension

np.shape(X)

(30, 2)

Now, we will tell k-means to divide it into three clusters

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5);

We will now try out k-means on sklearn generated data. Let's make five random blobs for this test.

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=500, centers=5,
                       cluster_std=0.8, random_state=3)
plt.scatter(X[:, 0], X[:, 1], s=20);

Thus, we assign k=5,

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

Let's visualize the data as usual.

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=150, alpha=0.5);
iterations = kmeans.n_iter_
print(iterations)

2

In this case, we discovered that all the means are converged to the center of each blob in 3 iterations

Custom dataset

Thus, let's try out k-means with our custom dataset. Iris dataset is yet again used for this experiment.

df = pd.read_csv('iris_data.csv')
df.head()

Extract X values for clustering

As k-means does not require labels, we assign all the features to X.

X1 = df.iloc[:,0:4]
X1 = X1.values
plt.scatter(X1[:, 2], X1[:, 3], s=50)

# X1 = df.iloc[:,2:4]
# X1 = X1.values
# plt.scatter(X1[:, 0], X1[:, 1], s=50)

<matplotlib.collections.PathCollection at 0x217502bdd90>

As we have three classes in our dataset, we set k=3

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, n_init= 10)
kmeans.fit(X1)
y1_kmeans = kmeans.predict(X1)

Let's plot our output on petal length and petal width axes-

plt.scatter(X1[:, 2], X1[:, 3], c=y1_kmeans, s=30, cmap='viridis', alpha = 0.6)
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 2], centers[:, 3], c='black', s=150, alpha=0.5);

Evaluation

As, K-means is a clustering model, it outputs clusters assigned to random integers and they may sometimes be of different order. Thus we convert our labels to integers accordingly to evaluate the model.

y1_kmeans

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])

y1 = df.loc[:,'variety']
y1 = y1.replace(['Setosa','Versicolor','Virginica'],[0,1,2])
y1 = y1.values
y1

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

from sklearn.metrics import accuracy_score
accuracy_score(y1_kmeans, y1)

0.8933333333333333

Let's see what our confusion matrix looks like-

from sklearn.metrics import confusion_matrix
import seaborn as sns

mat = confusion_matrix(y1, y1_kmeans)
sns.heatmap(mat.T, square=True, annot=True,fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

When not to use k-means

k-means has a leniar behavior and won't be optimal for all clusters. Let's see how k-means perform on a dataset with non-linear behavior.

from sklearn.datasets import make_moons
X, y = make_moons(300, noise=.05, random_state=0)

labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

Thus, another method called SpectralClustering is used. We won't dig deeper into this method today. If you want to find out what it does, this article is recommended.

from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           assign_labels='kmeans')
labels = model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis');

Example : k-means for color compression

Color compression within images is another interesting application of k-means clustering. Imagine, for example, seeing an image with millions of colors. A large number of colors will not be used in most of the images, and many of the pixels in the image will have colors that are similar or even identical. In this experiment, we compressed.

plt.figure(figsize=(10,10))
img = cv2.imread("images/TDG.jpg")
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
ax = plt.axes(xticks=[], yticks=[])
ax.imshow(img);

img.shape

(810, 1080, 3)

Here, we stack the pixels to a single dimension so that it can fix into k-means

data = img / 255.0 # use 0...1 scale
data = data.reshape(img.shape[0] * img.shape[1], 3)
data.shape

(874800, 3)

Here, we have 874,800 data points to cluster and that could be a little too much for k-means. Thus, we use MiniBatchKMeans to fit our datapoints.

import warnings; warnings.simplefilter('ignore')  # Fix NumPy issues.

from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]

Let's recolor our image with these compressed colors and see the difference-

img_recolored = new_colors.reshape(img.shape)

fig, ax = plt.subplots(1, 2, figsize=(16, 6),
                       subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(img)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(img_recolored)
ax[1].set_title('16-color Image', size=16);

We can see that not much detail is lost during the process. and the colors are compressed efficiently

References

This lecture notebook is referenced from PythonDataScienceHandbook. Foloow thw link if you want more intuition about this model.

	sepal_length	sepal_width	petal_length	petal_width	variety
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa