**
Fill This Form To Receive Instant Help**

Homework answers / question archive / In this assignment, we will implement k-nn and logistic regression classifiers to detect ”fake” banknotes and analyze the comparative importance of features in predicting accuracy

In this assignment, we will implement k-nn and logistic regression classifiers to detect ”fake” banknotes and analyze the

comparative importance of features in predicting accuracy.

For the dataset, we use ”banknote authentication dataset” from

the machine Learning depository at UCI: https://archive.

ics.uci.edu/ml/datasets/banknote+authentication

Dataset Description: From the website: ”This dataset

contains 1,372 examples of both fake and real banknotes. Data

were extracted from images that were taken from genuine and

forged banknote-like specimens. For digitization, an industrial

camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance

to the investigated object gray-scale pictures with a resolution

of about 660 dpi were gained. Wavelet Transform tool were

used to extract features from images.”

There are 4 continuous attributes (features) and a class:

1. f1 - variance of wavelet transformed image

2. f2 - skewness of wavelet transformed image

3. f3 - curtosis of wavelet transformed image

4. f4 - entropy of image

Page 1

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

5. class (integer)

In other words, assume that you have a machine that examines

a banknote and computes 4 attributes (step 1). Then each banknote is examined by a much more expensive machines and/or

by human expert(s) and classified as fake or real (step 2). The

second step is very time-consuming and expensive. You want

to build a classifier that would give you results after step 1 only.

We assume that class 0 are good banknotes. We will use color

”green” or ”+” for legitimate banknotes. Class 1 are assumed

to be fake banknotes and we will use color ”red” or ”−” for

counterfeit banknotes. These are ”true” labels.

Question 1:

1. load the data into dataframe and add column ”color”. For

each class 0, this should contain ”green” and for each class

1 it should contain ”red”

2. for each class and for each feature f1, f2, f3, f4, compute its

mean µ() and standard deviation σ(). Round the results to

2 decimal places and summarize them in a table as shown

below:

3. examine your table. Are there any obvious patterns in the

distribution of banknotes in each class

Page 2

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

class µ(f1) σ(f1) µ(f2) σ(f2) µ(f3) σ(f3) µ(f4) σ(f4)

0

1

all

Question 2:

1. split your dataset X into training Xtrain and Xtesting parts

(50/50 split). Using ”pairplot” from seaborn package, plot

pairwise relationships in Xtrain separately for class 0 and

class 1. Save your results into 2 pdf files ”good bills.pdf”

and ”fake bills.pdf”

2. visually examine your results. Come up with three simple

comparisons that you think may be sufficient to detect a

fake bill. For example, your classifier may look like this:

# assume you are examining a bill

# with features f_1 ,f_2 ,f_3 and f_4

# your rule may look like this :

if ( f_1 > 4) and ( f_2 > 8) and ( f_4 < 25):

x = " good "

else :

x = " fake "

Page 3

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

3. apply your simple classifier to Xtest and compute predicted

class labels

4. comparing your predicted class labels with true labels, compute the following:

(a) TP - true positives (your predicted label is + and true

label is +)

(b) FP - false positives (your predicted label is + but true

label is −

(c) TN - true negativess (your predicted label is − and true

label is −

(d) FN - false negatives (your predicted label is − but true

label is +

(e) TPR = TP/(TP + FN) - true positive rate. This is the

fraction of positive labels that your predicted correctly.

This is also called sensitivity, recall or hit rate.

(f) TNR = TN/(TN + FP) - true negative rate. This is the

fraction of negative labels that your predicted correctly.

This is also called specificity or selectivity.

5. summarize your findings in the table as shown below:

6. does you simple classifier gives you higher accuracy on identifying ”fake” bills or ”real” bills” Is your accuracy better

than 50% (”coin” flipping)?

Page 4

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

TP FP TN FN accuracy TPR TNR

Question 3 (use k-NN classifier using sklearn library)

1. take k = 3, 5, 7, 9, 11. Use the same Xtrain and Xtest as

before. For each k, train your k-NN classifier on Xtrain and

compute its accuracy for Xtest

2. plot a graph showing the accuracy. On x axis you plot k

and on y-axis you plot accuracy. What is the optimal value

k

∗

of k?

3. use the optimal value k

∗

to compute performance measures

and summarize them in the table

TP FP TN FN accuracy TPR TNR

4. is your k-NN classifier better than your simple classifier for

any of the measures from the previous table?

5. consider a bill x that contains the last 4 digits of your BUID

as feature values. What is the class label predicted for this

Page 5

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

bill by your simple classifier? What is the label for this bill

predicted by k-NN using the best k

∗

?

Question 4: One of the fundamental questions in machine

learning is ”feature selection”. We try to come up with a least

number of features and still retain good accuracy. The natural

question is whether some of the features are important or can

be dropped.

1. take your best value k

∗

. For each of the four features

f1, . . . , f4, drop that feature from both Xtrain and Xtest.

Train your classifier on the ”truncated” Xtrain and predict labels on Xtest using just 3 remaining features. You

will repeat this for 4 cases: (1) just f1 is missing, (2) just

f2 missing, (3) just f3 missing and (4) just f4 is missing.

Compute the accuracy for each of these scenarious.

2. did accuracy increase in any of the 4 cases compared with

accuracy when all 4 features are used?

3. which feature, when removed, contributed the most to loss

of accuracy?

4. which feature, when removed, contributed the least to loss

of accuracy?

Page 6

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

Question 5 (use logistic (regression classifier using sklearn

library)

1. Use the same Xtrain and Xtest as before. Train your logistic

regression classifier on Xtrain and compute its accuracy for

Xtest

2. summarize your performance measures in the table

TP FP TN FN accuracy TPR TNR

3. is your logistic regression better than your simple classifier

for any of the measures from the previous table?

4. is your logistic regression better than your k-NN classifier

(using the best k

∗

) for any of the measures from the previous

table?

5. consider a bill x that contains the last 4 digits of your BUID

as feature values. What is the class label predicted for this

bill x by logistic regression? Is it the same label as predicted

by k-NN?

Page 7

BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)

Question 6: We will investigate change in accuracy when

removing one feature. This is similar to question 4 but now we

use logistic regression.

1. For each of the four features f1, . . . , f4, drop that feature

from both Xtrain and Xtest. Train your logistic regression

classifier on the ”truncated” Xtrain and predict labels on

Xtest using just 3 remaining features. You will repeat this

for 4 cases: (1) just f1 is missing, (2) just f2 missing, (3)

just f3 missing and (4) just f4 is missing. Compute the

accuracy for each of these scenarious.

2. did accuracy increase in any of the 4 cases compared with

accuracy when all 4 features are used?

3. which feature, when removed, contributed the most to loss

of accuracy?

4. which feature, when removed, contributed the least to loss

of accuracy?

5. is relative significance of features the same as you obtained

using k-NN?

Already member? Sign In