1) Download the file SearchFoundations.zip. This Javaproject

Homework answers / question archive / 1) Download the file SearchFoundations

1) Download the file SearchFoundations

Computer Science

Share With

1) Download the file SearchFoundations.zip. This Javaprojectl (add the folders in the ZIP file to the src/directoryof a new Javaproject in your favorite IDE) contains the foundations for the large-scale search engine that you will be developingin this course. In this homework you will familiarizeyourselfwith the given source code, and then finish a smallpartof the missing functionalityto completea basicsingle-termsearchengine using a term-documentmatrix. Begin by reviewing the following classes, which serve as the basic interfacesfor this assignment:

• In documents2 :

- Document: represents an abstract document, with an integer ID, a string title, and can produce a stream of text representing the document's content.

* FileDocument: extends Document to represent a document built from a single file. * TextFileDocument: implements FileDocument by loading the full contents of a simple text

file as the contents of the document. - DocumentCorpus: abstracts a collection of documents, without specifying where those documents

come from or what kinds of documents they are. * DirectoryCorpus: a corpus where the documents are loaded from a directory on the local

file system. Each document is derived from FileDocument, and the corpus can be configured with instructions on which files to load and how to construct Document objects from them.

• In indexes: - Posting: a simple wrapper around an integer document ID. - Index: an interface defining the operations of a search engine index. For now, an index provides

two simple methods: List<Posting> getPostings(String term), retrieving the postings list for a given term; and List<String> getVocabulary(), returning a sorted list of the index's complete vocabulary.

* Note the use of List, not ArrayList: the public (code modules that use the Index) should not care if the postings are stored in an array list or a linked list or some other implementation; it is unimportant to them. However, the order of the postings does indeed matter; in fact, we will count on the postings being in increasing order when we do AND/OR merges later. List is therefore the appropriate return type from this method.

* Also note that the index is given a term, not a token, and can assume that the term is in its final processed state.

- TermDocumentIndex: an implementation of Index using a boolean term-document matrix to store postings. Has some TODO items for you.

• In text: - TokenProcessor: an interface defining the capabilities of a token processor, a component that

transforms a token (a literal string read from a document) into a term (the processed, normalized equivalent of that string).

- TokenStream: an interface defining a stream of tokens that is read from the content of a Document. - BasicTokenProcessor: an implementation TokenProcessor that removes all non-alphanumeric

characters and lowercases a token. - EnglishTokenStream: an implementation TokenStream that breaks a document's content into

tokens using whitespace characters.

1There are alternate downloads for Python and C#.

You are now ready to start the assignment. 1. If you haven't already, create a new Java project in your favorite IDE, then extract the files from

SearchFoundations.zip into the src/ directory of that project. 2. Download the file MobyDick10Chapters.zip from BeachBoard. This file contains the first 10 chapters

of Herman Melville's Moby Dick, each chapter separated into its own .txt file. Extract the ZIP file to the root of your project's directory.

3. Run TermDocumentIndexer.java3 . The program should print the names of the 10 files as it opens and indexes them into a TermDocumentIndex using a BasicTokenProcessor.

4. Finish TermDocumentIndexer.getPostings(). Read the TODO notes in that method and complete it so that a list of postings for the given term is returned, instead of an empty list.

5. Run TermDocumentIndexer.java again. You should now see the names of the documents that contain the term whale. Verify that those documents do contain that term.

6. Finish TermDocumentIndexer.main(). Read the TODO notes to expand the application so that the user is prompted for a term to search for, instead of hard-coding the term whale. Loop the main until the user enters "quit".

1) Download the file SearchFoundations

Computer Science

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions

menu

1) Download the file SearchFoundations

Computer Science

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions