Assignment 7
finish logistic regression for banknotes
(questions 5 and 6)
for the logistic regression in banknote dataset run the following experiment:
add this (required)
use scaling on your data set
apply logistic regression and look at its coefficients (lr
Computer Science
Share With
Assignment 7
finish logistic regression for banknotes
(questions 5 and 6)
for the logistic regression in banknote dataset run the following experiment:
add this (required)
use scaling on your data set
apply logistic regression and look at its coefficients (lr.coeff_)
look at the highest coefficient -à corresponding feature is the most important
Q: is this feature the same as obtained by “feature elimination
and stockmarket
In this assignment, we will implement k-nn and logistic re- gression classifiers to detect ”fake” banknotes and analyze the comparative importance of features in predicting accuracy.
For the dataset, we use ”banknote authentication dataset” from the machine Learning depository at UCI: https://archive. ics.uci.edu/ml/datasets/banknote+authentication
Dataset Description: From the website: ”This dataset contains 1,372 examples of both fake and real banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final im- ages have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.”
There are 4 continuous attributes (features) and a class:
1. f1 - variance of wavelet transformed image
2. f2 - skewness of wavelet transformed image
3. f3 - curtosis of wavelet transformed image
4. f4 - entropy of image
5. class (integer)
In other words, assume that you have a machine that examines a banknote and computes 4 attributes (step 1). Then each ban- knote is examined by a much more expensive machine and/or by human expert(s) and classified as fake or real (step 2). Step 2 is very time-consuming and expensive. You want to build a classifier that would give you good results after step 1 only.
We assume that class 0 are good banknotes. We will use color ”green” or ”+” for legitimate banknotes. Class 1 are assumed to be fake banknotes and we will use color ”red” or ”−” for counterfeit banknotes. These are ”true” labels.
Question 1:
load the data into Pandas dataframe and add a column ”color”. For each class 0, this should contain ”green” and for each class 1 it should contain ”red”
for each class and for each feature f1, f2, f3, f4, compute its mean μ() and standard deviation σ(). Round the results to 2 decimal places and summarize them in a table as shown below:
examine your table. Are there any obvious patterns in the distribution of banknotes in each class
Question 2:
split your dataset X into training Xtrain and Xtesting parts (50/50 split). Using ”pairplot” from seaborn package, plot pairwise relationships in Xtrain separately for class 0 and class 1. Save your results into 2 pdf files ”good bills.pdf” and ”fake bills.pdf”
visually examine your results. Come up with three simple comparisons that you think may be sufficient to detect a fake bill. For example, your classifier may look like this:
# assume you are examining a bill
# with features f_1,f_2,f_3 and f_4
# your rule may look like this: if (f_1 > 4) and (f_2 > 8) and (f_4 < 25):
x = "good"
else:
x = "fake"
3. apply your simple classifier to Xtest and compute predicted class labels
4. compare your predicted class labels with true labels in Xtest, compute the following:
(a) TP - true positives (your predicted label is + and true label is +)
(b) FP - false positives (your predicted label is + but true label is −
(c) TN - true negativess (your predicted label is − and true label is −
(d) FN - false negatives (your predicted label is − but true label is +
(e) TPR = TP/(TP + FN) - true positive rate. This is the fraction of positive labels that your predicted correctly. This is also called sensitivity, recall or hit rate.
(f) TNR = TN/(TN + FP) - true negative rate. This is the fraction of negative labels that your predicted correctly. This is also called specificity or selectivity.
5. summarize your findings in the table as shown below:
6. does you simple classifier gives you higher accuracy on iden- tifying ”fake” bills or ”real” bills” Is your accuracy better than 50% (”coin” flipping)?
Question 3 (use k-NN classifier using sklearn library)
take k = 3, 5, 7, 9, 11. For each k, generate Xtrain and Xtest using 50/50 split as before. Train your k-NN classifier on Xtrain and compute its accuracy for Xtest
plot a graph showing the accuracy. On x axis you plot k and on y-axis you plot accuracy. What is the optimal value k∗of k?
use the optimal value k∗to compute performance measures and summarize them in the table
TP
FP
TN
FN
accuracy
TPR
TNR
is your k-NN classifier better than your simple classifier for any of the measures from the previous table?
consider a bill x that contains the last 4 digits of your BUID as feature values. What is the class label predicted for this bill by your simple classifier? What is the label for this bill predicted by k-NN using the best k∗?
Question 4: One of the fundamental questions in machine learning is ”feature selection”. We try to come up with a least number of features and still retain good accuracy. The natural question is whether some of the features are important or can be dropped.
1. take your best value k∗. For each of the four features f1,...,f4, generate new Xtest and Xtrain and drop that feature from both Xtrain and Xtest. Train your classifier on the ”truncated” Xtrain and predict labels on Xtest using just 3 remaining features. You will repeat this for 4 cases: (1) just f1 is missing, (2) just f2 missing, (3) just f3 missing and (4) just f4 is missing. Compute the accuracy for each of these scenarious.
did accuracy increase in any of the 4 cases compared with accuracy when all 4 features are used?
which feature, when removed, contributed the most to loss of accuracy?
which feature, when removed, contributed the least to loss of accuracy?
Question 5 (use logistic (regression classifier using sklearn library)
Use 50/50 split to generate new Xtrain and Xtest. Train your logistic regression classifier on Xtrain and compute its accuracy for Xtest
summarize your performance measures in the table
TP
FP
TN
FN
accuracy
TPR
TNR
is your logistic regression better than your simple classifier for any of the measures from the previous table?
is your logistic regression better than your k-NN classifier (using the best k∗) for any of the measures from the previous table?
consider a bill x that contains the last 4 digits of your BUID as feature values. What is the class label predicted for this bill x by logistic regression? Is it the same label as predicted by k-NN?
Question 6: We will investigate change in accuracy when removing one feature. This is similar to question 4 but now we use logistic regression.
For each of the four features f1, . . . , f4, generate new Xtrain and Xtest and drop that feature from both Xtrain and Xtest. Train your logistic regression classifier on the ”truncated” Xtrain and predict labels on ”truncated” Xtest using just 3 remaining features. You will repeat this for 4 cases: (1) just f1 is missing, (2) just f2 missing, (3) just f3 missing and (4) just f4 is missing. Compute the accuracy for each of these scenarious.
did accuracy increase in any of the 4 cases compared with accuracy when all 4 features are used?
which feature, when removed, contributed the most to loss of accuracy?
which feature, when removed, contributed the least to loss of accuracy?
is relative significance of features the same as you obtained using k-NN?
In this assignment, we will consider a number of vari- ations of k-NN. We assume that we have N points P1,P2,...,PN with”green”or”red”labels(classes) in training set Xtrain. For simplicity, we will con fine ourselves to 2-dimensional space (just like your trading labels) and discuss the proposed variations to k-NN using geometric intuition. We will think of features as coordinates - a point Pi has coordinates (xi, yi). Suppose we want to classify a point A with coordinates (a1, a2).
We will consider the following methods (method names are ”unofficial”) to assign a label to this point A.
k-NN with Manhattan Distance. In sklearn, the default metric is Euclidean correponding to Minkowski distance with p=2. Recall that for
any two points P1 = (x1,y1) and P2 = (x2,y2) and parameter p > 0, the p-Minkowski distance |P1, P2|p (or Lp norm) is defined as
|P1, P2|p = (|x2 − x1|p + |y2 − y1|p)1/p
If p = 2 then we have the Euclidean (L2-norm).
If p = 1 we have Manhattan (street or L1-norm).
The parameter p is one of the paremeters that can be specified (just like the number of neighbors k).
2. k-NN with Minkowski p = 1.5. Intuitively, this is between Manhattan and Eucliedean.
3. Nearest Centroid: For each class, compute the corresponding ”mean” points (i.e. ”centers of gravity” or centroids) in the training set Xtrain. Let μ(Xgreen) and μ(Xred ) be the centroids for train train each class. For any point A, assign the label of the nearest centroid.
4. Domain Transformation: We map 2-dimensional representation of our points into 3-dimensional space using the following quadratic transformation:
and apply k-NN in the new 3-dimensional space.
5. k-Predicted Neighbors:. Find the nearest k neighbors B1, . . . , Bk (from training set) for point A. For each such neighbor Bi, ignore its true label and compute predicted label based on its k neighbors from Xtrain. Compute the label for A using the majority of predicted labels for B1,...,Bk (as opposed to the majority of true labels for B1, . . . , Bk as in standard k-NN).
QUESTIONS
Question 1 (Manhattan distance p = 1)
take k = 3,5,7,9,11. For each value of k com- pute the accuracy of this classifier. On x axis you plot k and on y-axis you plot accuracy. What is the optimal value of k for year 1?
use the optimal value of k from year 1 to predict labels for year 2. What is your accuracy?
using the optimal value for k from year 1, com- pute the confusion matrix for year 2
is this value k different than the one you obtained using regular k-NN
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the ”buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
how does this method compare with regular k-NN with Euclidean? Any improvement?
Question 2 (Minkowski distance p = 1.5)
take k = 3,5,7,9,11. For each value of k com- pute the accuracy of this classifier. On x axis you plot k and on y-axis you plot accuracy. What is the optimal value of k for year 1?
use the optimal value of k from year 1 to predict labels for year 2. What is your accuracy?
using the optimal value for k from year 1, com- pute the confusion matrix for year 2
is this value k different than the one you obtained using regular k-NN
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the ”buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
how does this method compare with regular k-NN with Euclidean distance? Any improvement?
Note: For questions 3-7 below, use the Euclidean distance!
Question 3 (Nearest Centroid)
for each label, compute the average and median distance to the ”green” and ”red’ centroids for the points in the training set. We can think of this distance as the average radius of the sphere centered at the centoids. Which sphere is larger (for both average and median distances)?
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the ”buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
how does this method compare with regular k- NN? Any improvement?
Question 4 (Domain Transformation)
take k = 3,5,7,9,11. For each value of k com- pute the accuracy of this classifier. On x axis you plot k and on y-axis you plot accuracy. What is the optimal value of k for year 1?
use the optimal value of k from year 1 to predict labels for year 2. What is your accuracy?
using the optimal value for k from year 1, com- pute the confusion matrix for year 2
is this value k different than the one you obtained using regular k-NN
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the ”buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
how does this method compare with regular k- NN? Any improvement?
Question 5 (k-Predicted Neighbors)
take k = 3,5,7,9,11. For each value of k com- pute the accuracy of this classifier. On x axis you plot k and on y-axis you plot accuracy. What is the optimal value of k for year 1?
use the optimal value of k from year 1 to predict labels for year 2. What is your accuracy?
using the optimal value for k from year 1, com- pute the confusion matrix for year 2
is this value k different than the one you obtained using regular k-NN
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the ”buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
how does this method compare with regular k- NN? Any improvement?
Question 6 (k-Hyperplanes)
take k = 3,5,7,9,11. For each value of k com- pute the accuracy of this classifier. On x axis you plot k and on y-axis you plot accuracy. What is the optimal value of k for year 1?
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the” buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
how does this method compare with regular k- NN? Any improvement?
Question 7.
Summarize the results for regular k-NN and its variations in the table below. Round Accuracy and Amount to integers. Color the largest values in Ac- curacy and Amount columns by green color and the lowest values by red. Discuss your findings.
Implement a logistic regression classifier. As before, use year 1 labels as training set and predict year 2 labels. For each week, your feature set is (μ, σ) for that week. Use your labels (you will have 52 weekly labels per year) from year 1 to train your classifier and predict labels for year 2.
Questions:
what is the equation for logistic regression that your classi- fier found from year 1 data?
what is the accuracy for year 2?
compute the confusion matrix for year 2
what is true positive rate (sensitivity or recall) and true negative rate (specificity) for year 2?
implement a trading strategy based on your labels for year 2 and compare the performance with the ”buy-and-hold” strategy. Which strategy results in a larger amount at the end of the year?
Purchase A New Answer
Custom new solution created by our subject matter experts