Fill This Form To Receive Instant Help
Homework answers / question archive / 1) Download the file SearchFoundations
1) Download the file SearchFoundations.zip. This Javaprojectl (add the folders in the ZIP file to the src/directoryof a new Javaproject in your favorite IDE) contains the foundations for the large-scale search engine that you will be developingin this course. In this homework you will familiarizeyourselfwith the given source code, and then finish a smallpartof the missing functionalityto completea basicsingle-termsearchengine using a term-documentmatrix. Begin by reviewing the following classes, which serve as the basic interfacesfor this assignment:
• In documents2 :
- Document: represents an abstract document, with an integer ID, a string title, and can produce a stream of text representing the document's content.
* FileDocument: extends Document to represent a document built from a single file. * TextFileDocument: implements FileDocument by loading the full contents of a simple text
file as the contents of the document. - DocumentCorpus: abstracts a collection of documents, without specifying where those documents
come from or what kinds of documents they are. * DirectoryCorpus: a corpus where the documents are loaded from a directory on the local
file system. Each document is derived from FileDocument, and the corpus can be configured with instructions on which files to load and how to construct Document objects from them.
• In indexes: - Posting: a simple wrapper around an integer document ID. - Index: an interface defining the operations of a search engine index. For now, an index provides
two simple methods: List<Posting> getPostings(String term), retrieving the postings list for a given term; and List<String> getVocabulary(), returning a sorted list of the index's complete vocabulary.
* Note the use of List, not ArrayList: the public (code modules that use the Index) should not care if the postings are stored in an array list or a linked list or some other implementation; it is unimportant to them. However, the order of the postings does indeed matter; in fact, we will count on the postings being in increasing order when we do AND/OR merges later. List is therefore the appropriate return type from this method.
* Also note that the index is given a term, not a token, and can assume that the term is in its final processed state.
- TermDocumentIndex: an implementation of Index using a boolean term-document matrix to store postings. Has some TODO items for you.
• In text: - TokenProcessor: an interface defining the capabilities of a token processor, a component that
transforms a token (a literal string read from a document) into a term (the processed, normalized equivalent of that string).
- TokenStream: an interface defining a stream of tokens that is read from the content of a Document. - BasicTokenProcessor: an implementation TokenProcessor that removes all non-alphanumeric
characters and lowercases a token. - EnglishTokenStream: an implementation TokenStream that breaks a document's content into
tokens using whitespace characters.
1There are alternate downloads for Python and C#.
2
You are now ready to start the assignment. 1. If you haven't already, create a new Java project in your favorite IDE, then extract the files from
SearchFoundations.zip into the src/ directory of that project. 2. Download the file MobyDick10Chapters.zip from BeachBoard. This file contains the first 10 chapters
of Herman Melville's Moby Dick, each chapter separated into its own .txt file. Extract the ZIP file to the root of your project's directory.
3. Run TermDocumentIndexer.java3 . The program should print the names of the 10 files as it opens and indexes them into a TermDocumentIndex using a BasicTokenProcessor.
4. Finish TermDocumentIndexer.getPostings(). Read the TODO notes in that method and complete it so that a list of postings for the given term is returned, instead of an empty list.
5. Run TermDocumentIndexer.java again. You should now see the names of the documents that contain the term whale. Verify that those documents do contain that term.
6. Finish TermDocumentIndexer.main(). Read the TODO notes to expand the application so that the user is prompted for a term to search for, instead of hard-coding the term whale. Loop the main until the user enters "quit".
3TermDocumentlndexer.py in Python