Lab 8h Documents

Purpose This lab aims to provide practice with a practical and fun problem involving documents.

To begin, download the file corpus.rkt (with right-click > "Save As"...) and at the top of your file put the line

(require "corpus.rkt")

This gives access to the following document constants:

RAVEN, The Raven by Edgar Allan Poe
AMERICAN-PIE, American Pie by Don McLean
SCARLET-LETTER, the fifth chapter of The Scarlet Letter by Nathaniel Hawthorne
TEN-PLAGUES, the chapters in The Bible related to the ten plagues of Egypt
MATTHIAS, the Wikipedia entry on Matthias Felleisen

It also gives access to the constant CORPUS, a list of all of the above constants. Feel free to use any and all of these, but the files are quite large, so we recommend testing on documents that you define.

Becoming Google

; A Corpus is a [Listof Document]
; A Document is a [Listof String] where every element is an English word
; A Query is a [Listof String] where every element is an English word

Now let’s think what makes a document relevant when someone searches the word "lackadaisical". How often it appears in the document is a good indicator of how relevant it is. In a document about the etymology of the word, for example, the word would appear quite often. In a novel, it may appear once or twice, but someone interested in the word "lackadaisical" would almost definitely not be interested in finding that novel. Thus, term frequency is a good indicator of how relevant a document is to that term.

Exercise 1 Design the function term-frequency, that takes a Document and a String and returns what percentage of the words in the document are equal to the given string. Use count.

; A Query is a [Listof String] where every element is an English word, all lower-case

Now let’s think about what makes a document irrelevant when someone searches for the query (list "a" "storm" "of" "swords"). "storm" and "swords" are certainly unique words and won’t appear in just any document, but "a" and "of" aren’t and would. Thus, how many times a word appears in any document is inversely proportional to how relevant it is to a query.

Exercise 2 Design the function idf, which is short for inverse-document-frequency, that takes a String and a Corpus and outputs the total number of documents divided by the number of documents the given word appears in + 1 (to avoid division by zero).

Exercise 3 Design the function tf-idf, that takes a Query, a Document and the Corpus it belongs to and outputs the sum of the product of each search term’s term-frequency in that document and its inverse-document-frequency.

Exercise 4 Design the function search-engine, that given a Corpus, will output a function that takes a Query and outputs the documents sorted in descending order by the tf-idf applied that query, document, and corpus. Use sort.

Yay! You are Google!

Exercise 5 Challenge: sort uses a function that compares two elements to each other. It has been proven that all comaprison-based sorting functions will compare at least one element x to other elements more than once.
In your search-engine function, documents were compared to each other based on their tf-idf value. For some document, this value was computed more than once. That’s bad!
The problem we’re going to solve is this: sometimes, we want to sort a list of values based on some (possibly slow) computation applied to each one. We definitely don’t want to do this computation on the same element more than once.
Define sort/compute-once, which takes a [List X] lox, an [X -> Y] f, and a [Y Y -> Boolean] y-comp, and sorts lox by y-comp after f is applied to its elements. f should only be computed once on each element, the function should still return a list identical to lox (but with the order changed), and it should use sort.

Mimicking

Now, given a piece of someone’s writing, we’re going to mimick it and produce a sentence (or something resembling a sentence) that they feasibly could have written. To give an intuition as to how we’ll do this, we’re going to use the frequency of how often certain words start sentences, end sentences, and how often a particular word follows another.

; A [Mapping X Y] is a [List [Pair X Y]] and associates
; elements of type X with elements of type Y
; invariant: For any [Mapping X Y] m,
; (map pair-x m) must be a [Set X], e.g.
; every pair in the list has a unique pair-x
(define-struct pair (x y))

; A [Counter X] is a [Mapping X PositiveInt]
; And represents a multiset of elements

; Example, which represents a bag of marbles
; with 2 green marbles and 5 red
(define MARBLE-BAG (list (make-pair 'green 2) (make-pair 'red 5)))

Note that mappings are a ubiquitous data type in computer science. The following two functions are two possible ways to interact with them that are specifically convenient for this lab, but every language will have pre-defined functions on mappings that operate differently in accordance with the style of the language.

Exercise 6 Design the function get, which given a [Mapping X Y] and an X will output the associated Y. If no such pair exists, the function should throw an error.

Exercise 7 Design the function update-mapping, which given a [Mapping X Y] m, an X x, a [Y -> Y] updater, and a Y backup, will output a new [Mapping X Y] identitcal to m, except if x was already in the map, updater will be applied to its associated value, otherwise it will now be associated with backup.

Exercise 8 Design the function add-to-counter, which given an X and a [Counter X], will add 1 to the X’s previously associated value (which, if it wasn’t in the mapping before, was 0).

Exercise 9 Design the function grab-random, which will grab a random element from a non-empty counter. The likelyhood of what element will be chosen is in accordance with the counts of the elements. For example, there is a 2/7th’s chance (grab-random MARBLE-BAG) will be 'green.

These are all the helper functions we’ll need that will allow us to model and imitate someone’s writing style. Of course, we now need to properly define what a writing style is.

; A WritingStyle is a [Mapping String [Counter String]]
; and represents, when an author writes a certain word,
; the distribution of the words that follow it
; Words that end sentences have "." in their counter
; and "" maps to words that start sentences

; Example:
(list
(make-pair "frog" (list (make-pair "." 1)))
(make-pair "" (list (make-pair "cat" 1)))
(make-pair "cat" (list (make-pair "frog" 1) (make-pair "log" 1)))
(make-pair "log" (list (make-pair "cat" 1))))
; represents the writing style of the sentence "cat log cat frog"

; A StructuredDocument is a [Listof Sentence]
; A Sentence is a [Listof String] where every String is an English word

Exercise 10 Design the function add-to-ws, which given a WritingStyle ws, and two Strings first-word and second-word. It will update ws to reflect that first-word was followed by second-word one more time than previously represented in ws.

Exercise 11 Design the function update-ws, which given a Writing Style and a [Listof String], updates the writing style to reflect all consecutive pairs of words with add-to-ws.

Exercise 12 Design the function style, which given a StructuredDocument will output a WritingStyle. At the beginning of every sentence in the structured document, add the empty string to signify the start of a sentence. At the end of every sentence, add the string "." to indicate the end.

Exercise 13 Design the function imitate, which given a WritingStyle, will output a Sentence indicative of the writing style.
The function should be defined with a local function that keeps track of some current-string. When that string is ".", the function should return the empty list. When it is some other string, it should cons that string onto the result of the local function called with a randomly selected word that follows current-string according to the writing style.
The local function should initially be called with the empty string, and the rest of the result should be taken so it does not include the empty string.

(define imitate-style (compose imitate style))

Feel free to use the above function on any of the structured documents provided by "corpus.rkt", such as:

STRUCTURED-RAVEN
STRUCTURED-AMERICAN-PIE
STRUCTURED-SCARLET-LETTER
STRUCTURED-TEN-PLAGUES
STRUCTURED-MATTHIAS

which are the strucuted equivalent of the documents mentioned earlier.

← prev up next →

	General Information
	Course Contract
	Texts
	Advice
	Office Hours, EMail
	Syllabus
	Labs
	Problem Sets
	The Style

	Lab 1h BSL in Dr Racket
	Lab 2h Structures
	Lab 3h The World
	Lab 4h Lists
	Lab 5h Abstraction and natural number recursion
	Lab 6h Loops and Lambda
	Lab 7h Fun With Higher-Order Functions
	Lab 8h Documents
	Lab 9h Generative Recursion and Accumulators
	Lab 10h Chat Client
	Lab 11h Type Inference