You can download resources for today from this link. We have also posted a guide video on downloading and accessing materials on youtube channel.

Datasets

Datasets comes in different forms from various sources. So the question here is what exactly is a dataset and how do we handle datasets for machine learning? To experiment the conditions, we must first know how to manipulate a dataset.

Brief Introduction to Pandas

Pandas is a python library for data manipulation and analysis. In this section, we will feature a brief introuction to pandas.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cv2
import math
%matplotlib inline

Pandas stores data in dataframe objects. We can assign columns to each to numpy array (or) list to create a dataframe.

names = ['Jack','Jean','Jennifer','Jimmy']
ages = np.array([23,22,24,21])
# print(type(names))
# print(type(ages))

df = pd.DataFrame({'name': names,
                   'age': ages,
                   'city': ['London', 'Berlin', 'New York', 'Sydney']},index=None)

df.head()
# df.style.hide_index()

Now, let's see some handy dataframe tricks.

df[['name','city']]

df.info()
# print(df.columns)
# print(df.age)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int32 
 2   city    4 non-null      object
dtypes: int32(1), object(2)
memory usage: 208.0+ bytes

Now that we know how to create a dataframe, we can save the dataframe we created.

df.to_csv('Ages_and_cities.csv',index=False,header=True)

df = pd.read_csv('Ages_and_cities.csv')
df.head()

Understanding your dataset

In this section, we used Iris flowers dataset, which contains petal and sepal measurements of three species of Iris flowers.

Three species of Iris flowers from the dataset

Sepal vs Petal

This dataset was introduced by biologist Ronald Fisher in his 1936 paper. The following figure explains the way length and width are mesured or petal and speal of each flower.

Image source

When we observe the dataset, we will discover that the dataset has four features and three unique labels for three flowers.

df = pd.read_csv('iris_data.csv')
df.head()
# df.head(3)

df.tail()

Slicing data

Now that we understand our dataset, let's prepare to seperate our data based on labels for unique visualization.

df.loc[80:85,("sepal_length","variety")]

# df.iloc[80:85,2:5]
df.iloc[80:85,[0,4]]

Se= df.loc[df.variety =='Setosa', :]
Vc= df.loc[df.variety =='Versicolor', :]
Vi= df.loc[df.variety =='Virginica', :]
Vi.head()

Feature visualization

df = pd.read_csv('iris_data.csv')
# df.dtypes

First, we will visualize each measurement with histograms to observe the output distribution for each class.

plt.figure(figsize=(15,15))

plt.subplot(2, 2, 1)
plt.hist(Se.sepal_length,bins=15,color="steelblue",edgecolor='black',alpha =0.4, label="Setosa")
plt.hist(Vc.sepal_length,bins=15,color='red',edgecolor='black', alpha =0.3, label="Versicolor")
plt.hist(Vi.sepal_length,bins=15,color='blue',edgecolor='black', alpha =0.3, label="Virginica")
plt.title("sepal length distribution"), plt.xlabel('cm')
plt.legend()

plt.subplot(2, 2, 2)
plt.hist(Se.sepal_width,bins=15,color="steelblue",edgecolor='black',alpha =0.4, label="Setosa")
plt.hist(Vc.sepal_width,bins=15,color='red',edgecolor='black', alpha =0.3, label="Versicolor")
plt.hist(Vi.sepal_width,bins=15,color='blue',edgecolor='black', alpha =0.3, label="Virginica")
plt.title("sepal width distribution"), plt.xlabel('cm')
plt.legend()

plt.subplot(2, 2, 3)
plt.hist(Se.petal_length,bins=10,color="steelblue",edgecolor='black',alpha =0.4, label="Setosa")
plt.hist(Vc.petal_length,bins=10,color='red',edgecolor='black', alpha =0.3, label="Versicolor")
plt.hist(Vi.petal_length,bins=10,color='blue',edgecolor='black', alpha =0.3, label="Virginica")
plt.title("petal length distribution"), plt.xlabel('cm')
plt.legend()

plt.subplot(2, 2, 4)
plt.hist(Se.petal_width,bins=10,color="steelblue",edgecolor='black',alpha =0.4, label="Setosa")
plt.hist(Vc.petal_width,bins=10,color='red',edgecolor='black', alpha =0.3, label="Versicolor")
plt.hist(Vi.petal_width,bins=10,color='blue',edgecolor='black', alpha =0.3, label="Virginica")
plt.title("petal width distribution"), plt.xlabel('cm')
plt.legend()

<matplotlib.legend.Legend at 0x1bb6f0b0370>

Now, we will visualize multiple features with scatter plots to gain some more insights.

plt.figure(figsize=(15,15))

area = np.pi*20

plt.subplot(2, 2, 1)
plt.scatter(Se.sepal_length,Se.sepal_width, s=area, c="steelblue", alpha=0.6, label="Setosa")
plt.scatter(Vc.sepal_length,Vc.sepal_width, s=area, c="red", alpha=0.6, label="Versicolor")
plt.scatter(Vi.sepal_length,Vi.sepal_width, s=area, c="blue", alpha=0.5, label="Virginica")
plt.title("sepal length Vs sepal width"), plt.xlabel('cm'), plt.ylabel('cm')
plt.legend()

plt.subplot(2, 2, 2)
plt.scatter(Se.petal_length,Se.petal_width, s=area, c="steelblue", alpha=0.6, label="Setosa")
plt.scatter(Vc.petal_length,Vc.petal_width, s=area, c="red", alpha=0.6, label="Versicolor")
plt.scatter(Vi.petal_length,Vi.petal_width, s=area, c="blue", alpha=0.5, label="Virginica")
plt.title("petal length Vs petal width"), plt.xlabel('cm'), plt.ylabel('cm')
plt.legend()

plt.subplot(2, 2, 3)
plt.scatter(Se.sepal_length,Se.petal_length, s=area, c="steelblue", alpha=0.6, label="Setosa")
plt.scatter(Vc.sepal_length,Vc.petal_length, s=area, c="red", alpha=0.6, label="Versicolor")
plt.scatter(Vi.sepal_length,Vi.petal_length, s=area, c="blue", alpha=0.5, label="Virginica")
plt.title("sepal length Vs petal length"), plt.xlabel('cm'), plt.ylabel('cm')
plt.legend()

plt.subplot(2, 2, 4)
plt.scatter(Se.sepal_width,Se.petal_width, s=area, c="steelblue", alpha=0.6, label="Setosa")
plt.scatter(Vc.sepal_width,Vc.petal_width, s=area, c="red", alpha=0.6, label="Versicolor")
plt.scatter(Vi.sepal_width,Vi.petal_width, s=area, c="blue", alpha=0.5, label="Virginica")
plt.title("sepal width Vs petal width"), plt.xlabel('cm'), plt.ylabel('cm')
plt.legend()

<matplotlib.legend.Legend at 0x1bb6f4e9730>

We can definitely see some blobs forming from these visualizations. "Setosa" class unsally stands out from the other two classes but the sepal width vs sepal length plot shows "versicolor" and "virginica" classes will more challenging to classify compared to "setosa" class.

Training the model

Scikit-learn is a free machine learning library for Python which features various classification, regression and clustering algorithms.

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sns

df = pd.read_csv('iris_data.csv')
# df.dtypes
df.tail()

train_X, test_X, train_y, test_y = train_test_split(df[df.columns[0:4]].values,
                                                    df.variety.values, test_size=0.25)

modelDT = DecisionTreeClassifier().fit(train_X, train_y)
DT_predicted = modelDT.predict(test_X)

modelRF = RandomForestClassifier().fit(train_X, train_y)
RF_predicted = modelRF.predict(test_X)

Model Evaluation

Decision Tree classifier

print(metrics.classification_report(DT_predicted, test_y))

              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        14
  Versicolor       0.78      0.88      0.82         8
   Virginica       0.93      0.88      0.90        16

    accuracy                           0.92        38
   macro avg       0.90      0.92      0.91        38
weighted avg       0.93      0.92      0.92        38

mat = metrics.confusion_matrix(test_y, DT_predicted)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

Ramdom Forest Classifier

print(metrics.classification_report(RF_predicted, test_y))

              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        14
  Versicolor       0.78      0.88      0.82         8
   Virginica       0.93      0.88      0.90        16

    accuracy                           0.92        38
   macro avg       0.90      0.92      0.91        38
weighted avg       0.93      0.92      0.92        38

from sklearn.metrics import confusion_matrix
import seaborn as sns

mat = confusion_matrix(test_y, RF_predicted)
sns.heatmap(mat.T, square=True, annot=True,fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

colab notebook

Feature Engineering

When generating new features, the product between two features is usually not recommended to engineer unless it makes a magnification of the situation. Here, we use two new features, petal hypotenuse and petal product.

df = pd.read_csv('iris_data.csv')
df['petal_hypotenuse'] = np.sqrt(df["petal_length"]**2+df["petal_width"]**2)
df['petal_product']=df["petal_length"]*df["petal_width"]

df.tail()

Se= df.loc[df.variety =='Setosa', :]
Vc= df.loc[df.variety =='Versicolor', :]
Vi= df.loc[df.variety =='Virginica', :]

plt.figure(figsize=(16,8))

plt.subplot(1, 2, 1)
plt.hist(Se.petal_hypotenuse,bins=10,color="steelblue",edgecolor='black',alpha =0.4 , label="Setosa")
plt.hist(Vc.petal_hypotenuse,bins=10,color='red',edgecolor='black', alpha =0.3, label="Versicolor")
plt.hist(Vi.petal_hypotenuse,bins=10,color='blue',edgecolor='black', alpha =0.3, label="Virginica")
plt.legend()
plt.title("petal hypotenuse distribution"), plt.xlabel('cm')

plt.subplot(1, 2, 2)
plt.hist(Se.petal_product,bins=10,color="steelblue",edgecolor='black',alpha =0.4, label="Setosa")
plt.hist(Vc.petal_product,bins=10,color='red',edgecolor='black', alpha =0.3, label="Versicolor")
plt.hist(Vi.petal_product,bins=10,color='blue',edgecolor='black', alpha =0.3, label="Virginica")
plt.legend()
plt.title("petal product distribution"), plt.xlabel('cm')

(Text(0.5, 1.0, 'petal product distribution'), Text(0.5, 0, 'cm'))

plt.figure(figsize=(10,10))

area = np.pi*20

plt.scatter(Se.petal_hypotenuse,Se.petal_product, s=area, c="steelblue", alpha=0.6, label="Setosa")
plt.scatter(Vc.petal_hypotenuse,Vc.petal_product, s=area, c="red", alpha=0.6, label="Versicolor")
plt.scatter(Vi.petal_hypotenuse,Vi.petal_product, s=area, c="blue", alpha=0.5, label="Virginica")
plt.title("petal hypotenuse Vs petal product"), plt.xlabel('cm'), plt.ylabel('cm^2')
plt.legend()

<matplotlib.legend.Legend at 0x1bb72567f10>

Train with Engineered features

Now, let's replace two petal features with two new features we generated.

df.head()

df2 = df.loc[:,["sepal_length","sepal_width","petal_hypotenuse","petal_product","variety"]]
df2.dtypes

sepal_length        float64
sepal_width         float64
petal_hypotenuse    float64
petal_product       float64
variety              object
dtype: object

train_X, test_X, train_y, test_y = train_test_split(df2[df2.columns[0:4]].values,
                                                    df2.variety.values, test_size=0.25)

from sklearn.tree import DecisionTreeClassifier
modelDT = DecisionTreeClassifier().fit(train_X, train_y)
DT_predicted = modelDT.predict(test_X)

from sklearn.ensemble import RandomForestClassifier
modelRF = RandomForestClassifier().fit(train_X, train_y)
RF_predicted = modelRF.predict(test_X)

print(metrics.classification_report(DT_predicted, test_y))
# print(metrics.classification_report(RF_predicted, test_y))

              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        14
  Versicolor       0.90      1.00      0.95         9
   Virginica       1.00      0.93      0.97        15

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38

from sklearn.metrics import confusion_matrix
import seaborn as sns

mat = confusion_matrix(test_y, DT_predicted)
# mat = confusion_matrix(test_y, RF_predicted)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

Reference - Python Data Science Handbook

Annotations

Image labels

For classification models, we have a single label for each set of images in the same class. Annotations can be made very easily.

Bounding boxes

We usually use rectangular bounding boxes for object detection. Detection Models like YOLO and Faster-RCNN use this type of annotations. Bounding boxes are ususally represented by either the coordinates (x1,y1) lower left corner or (x2,y2) upper right corner of the box, followed by height and wigth of the bounding box.

Segmentation

Polygonal Segmentation

Bounding boxes are simple but not ideal for all types of objects as we have to frame every object in a rectangular box. To solve this problem, polygonal segmentation is introduced. With this method, we can annotate the exact features of the objects with polygons. The image below is from one of my projects for segmentation of temples in ASEAN.

Semantic Segmentation

This technique takes segmentation to the pixel level. A particular class is assigned to every pixel in the image. Semantic segmentation is used mainly in situations where there is a very significant environmental context. It is used, for instance, in self-driving cars and robotics so that the models understand the environment in which they operate.

	sepal_length	sepal_width	petal_length	petal_width	variety
145	6.7	3.0	5.2	2.3	Virginica
146	6.3	2.5	5.0	1.9	Virginica
147	6.5	3.0	5.2	2.0	Virginica
148	6.2	3.4	5.4	2.3	Virginica
149	5.9	3.0	5.1	1.8	Virginica

	sepal_length	sepal_width	petal_length	petal_width	variety
100	6.3	3.3	6.0	2.5	Virginica
101	5.8	2.7	5.1	1.9	Virginica
102	7.1	3.0	5.9	2.1	Virginica
103	6.3	2.9	5.6	1.8	Virginica
104	6.5	3.0	5.8	2.2	Virginica

	sepal_length	sepal_width	petal_length	petal_width	variety
145	6.7	3.0	5.2	2.3	Virginica
146	6.3	2.5	5.0	1.9	Virginica
147	6.5	3.0	5.2	2.0	Virginica
148	6.2	3.4	5.4	2.3	Virginica
149	5.9	3.0	5.1	1.8	Virginica

	sepal_length	sepal_width	petal_length	petal_width	variety	petal_hypotenuse	petal_product
145	6.7	3.0	5.2	2.3	Virginica	5.685948	11.96
146	6.3	2.5	5.0	1.9	Virginica	5.348832	9.50
147	6.5	3.0	5.2	2.0	Virginica	5.571355	10.40
148	6.2	3.4	5.4	2.3	Virginica	5.869412	12.42
149	5.9	3.0	5.1	1.8	Virginica	5.408327	9.18

	sepal_length	sepal_width	petal_length	petal_width	variety	petal_hypotenuse	petal_product
0	5.1	3.5	1.4	0.2	Setosa	1.414214	0.28
1	4.9	3.0	1.4	0.2	Setosa	1.414214	0.28
2	4.7	3.2	1.3	0.2	Setosa	1.315295	0.26
3	4.6	3.1	1.5	0.2	Setosa	1.513275	0.30
4	5.0	3.6	1.4	0.2	Setosa	1.414214	0.28

	sepal_length	variety
80	5.5	Versicolor
81	5.5	Versicolor
82	5.8	Versicolor
83	6.0	Versicolor
84	5.4	Versicolor
85	6.0	Versicolor

name	age	city
Jack	23	London
Jean	22	Berlin
Jennifer	24	New York
Jimmy	21	Sydney

	name	city
0	Jack	London
1	Jean	Berlin
2	Jennifer	New York
3	Jimmy	Sydney

	name	age	city
0	Jack	23	London
1	Jean	22	Berlin
2	Jennifer	24	New York
3	Jimmy	21	Sydney