In this assignment you will create and evaluate a Naive Baye

Homework answers / question archive / In this assignment you will create and evaluate a Naive Bayesian classifier

In this assignment you will create and evaluate a Naive Bayesian classifier

Computer Science

Share With

In this assignment you will create and evaluate a Naive Bayesian classifier. Your classifier will learn to assign sentiment to movie reviews.

Make sure you have carefully read Chapter 4 in Jurafsky and Martin and watched the associated videos from Jurafsky and Manning before going any further. See the assigned reading page for links.

You will be writing two programs in the language of your choice. The first (naive-bayes-classifier) will learn learn a Naive Bayes classifier from training data and then use that to assign sentiment to the reviews in a set of test data. The second (naive-bayes-eval) will evaluate the accuracy of your classifier on that test data.

All of the data you need for this assignment is found in our Google Drive in directory PA-Data in a file named PA3-Pang-Lee.zip. When you unzip this you will find training data in sentiment-training.txt, test data in sentiment-test.txt and the gold standard correct classifications in sentiment-gold.txt. The training and test files are formatted such that each line consists of a single movie review. The first field in both files is the review id which is followed by the correct gold-standard sentiment classification (in the training data only). The correct gold-standard sentiment classifications for the test data are found in sentiment-gold_txt.

Your program naive-bayes-classifier must learn a Naive Bayes classifier using (at a minimum) unigram features. This is also known as the bag of words feature set. Make sure your program computes and uses the conditional probabilities that make up the definition of the

Naive Bayes classifier (see Chapter 4 of Jurafsky and Martin for those details). You must include a command line argument to set a frequency cutoff for your unigram features. You may also at your option include other features, although this is not required. If your additional features require some kind of value to be set to find them (like the frequency cutoff for unigrams) then please do that via the command line.

Your program naive-bayes-classifier will have at least 3 command line arguments: your unigram frequency cutoff (N), the training file

(sentiment-train.txt) and the test file (sentiment-test.txt). This program will learn a Naive Bayes classifier from the training data and then assign sentiment values to all of the reviews in the test data (which will be stored in an output file named naive-bayes-answers.txt).

Below are examples of how your programs should be run using the Linux command line. You do not need to use Linux, but your programs should run in the same way, in particular taking all file names and parameter values from the command line. Do not hard code file names or parameter values in your source code.

» naive-bayes-classifier N 2 sentiment-train.txt sentiment-test.txt > naive-bayes-answers. txt

The value of 2 indicates that unigrams that occur 2 or more times will be included as features. You should experiment with various values of this parameter and report results using the one that results in the highest accuracy. Please make it clear in your comments or output which value of N you are reporting results for.

You should also report results for a simple sanity check for your Naive Bayes classifier. Suppose there are no features used in your classifier. In this case a Naive Bayes classifier should still work, it will rely on p(class) and act like a majority classifier. You can cause this situation for your classifier by setting N to a very high value (like 10,000) since there no unigrams that will occur in your training data that many times.

Make sure that naive-bayes-classifier does not access the test data until after your classifier is learned, otherwise there is a risk your features or classifier could include information from the test data. This is considered “cheating” in machine learning, since our goal is to evaluate the classifier on data it has never seen before.

The output file naive-bayes-answers.txt should be a plain text file that is formatted such that there is one output line for each review in the test file. Each line should include the review id of the movie in the test data, the probability of the features given that the class is positive

p(Features | positive), and the probability of the features given the class is negative p(Features | negative), and the class assigned by your classifier to this review. Note that the probability p(Features | Class) should be computed using equation 4.7 from the Jurafsky and

Martin text. p(Features | Class) is also known as the Likelinood in the Naive Bayes classifier. Each line of output should follow this format :

» review_id p(Features | positive) p(Features | negative) class

For example :

« cv666_tok_13320.txt 0.000034 0.0000001 1

This line means that the features in review cv666_tok_13320.txt have a probability of .000034 if the review is positive, a probability of

0.0000001 if the review is negative, and that this review was assigned to the positive (1) class by our classifier. Note that these values are

made up to illustrate the format and so your values may be of a very different magnitude (smaller or larger).

Your second program naive-bayes-eval, will have two input files : your answer file (naive-bayes-answers.txt) and the gold standard answers (sentiment-gold.txt). This program will compare each of your system answers to the gold standard, and should output the review

id, the gold answer, and the system answer (on a single line). Then, your program should report the accuracy, precision, and recall of your classifier, and the number of true positives, false positives, true negatives, and false negatives. These values should be written one per line at the end of this file, which should be called naive-bayes-answers-scored.txt.

» naive-bayes-eval sentiment-gold.txt naive-bayes-answers.txt > naive-bayes-answers-scored.txt

Both of your programs should be documented according to the standards of the programming assignment rubric. Remember that each program should have an overall introductory comment at the start and then detailed comments throughout the source code as described in the programming assignment grading rubric. You will have a variety of decisions to make in the design of this program, please make sure to clearly explain them in your source code comments, remember, assume that the reader of your program is not too famliar with the language you have used.

Please do not submit screen shots or cut & paste from terminal windows. Instead export your source code to pdf, and print your output to pdf.

Your source code should have line numbers. Both source code and output should be on a white background.

You may use code from libraries, but do not use any libraries or pre-existing code that are specific to NLP or Machine Learning, in particular

those that carry out Naive Bayesian classification or classifier evaluation. Any regular expressions you use for text normalization, pre- processing, tokenization, and sentence boundary detection must be of your own creation.

If you find yourself searching for NLP or Machine Learning specific code, you should stop yourself and allow your own ideas to develop. If you are copying code specific to this assignment that you or a member of your team didn't write, please stop yourself. All of your NLP and

Machine Learning code should be original to you and your team.

The specific things | will be looking for when grading the functionality of your program are :

» 1 point — naive-bayes-classifier uses unigrams with a frequency cutoff

» 1 point — naive-bayes-classifier reports probabilities that are plausible and correspond to predicted class

» 1 point — naive-bayes-classifier correctly handles sanity case with N set to 10,000 (accuracy should be approximately 50%) (this should happen naturally and not be handled via a special case)

® 1 point - naive-bayes-eval reports accuracy, precision, recall, and counts of true positives, false positives, true negatives, and false negatives.

» 1 point - accuracy reaches at least 60% for best N result.

pur-new-sol