CSG339 06F: Homework 01

Created: Thu 14 Sep 2006
Last modified:

Assigned: Thu 14 Sep 2006
Due: Thu 21 Sep 2006

Problem 1 (20 points)

Courtesy James Allan http://www-ciir.cs.umass.edu/~allan/

Run the following two queries on Google (http://google.com) and on Yahoo (http://yahoo.com) (do not run the explanation; just run the query). You will judge 10 documents for relevance to the query. The explanation will help you decide whether or not something is relevant.

1. gardening wet soil conditions

What special considerations must be made when planting a garden in very wet soil conditions? Are there plants that will not work and/or that will work particularly well? Are there any special techniques that will help reduce the amount of moisture? Only pages that deal with gardening are relevant.

2. oil vs. propane furnace

Looking for pages that list the tradeoffs for using a propane furnace rather than a fuel oil furnace. The pages should provide comparison and are only relevant if they talk about both. They can talk about other types of heating (e.g., electric), provided they talk about oil and propane. Propane is also called LP or LPG (liquid propane [gas]).

When judging pages for relevance, please note the following:

A page is relevant if it is on the appropriate topic. It is not relevant only because it has a link to a page that has relevant content.
Google indents some of the items on its ranked list to indicate that they come from the same domain (roughly) as the preceding entry. Include those entries. There should always be 10 entries per page.
If a page is not found when you follow it, use the cached version. If there is no cached version, make your judgment from the title and snippet. If you cannot make a call with that information, mark it as non-relevant.

To decide which 10 documents to judge, use the last digit of your student ID number

0 to 6 - do the first page (results 1-10)

7 to 9 - do the second page (results 11-20)

Use the following format (one result per line) :

G-or-Y query-number doc's-rank R-or-N doc's-title

---------------------------

where G-or-Y indicates whether you ran this on Google or Yahoo, query-number is 1 or 2 from above, doc's-rank is the number from 1 to 20 (or more, if appropriate) of this document's rank, R-or-N is R if the page is relevant and N otherwise, and doc's-title is the title of the document (to help us verify that results make sense). Since you are judging 10 pages from each of two queries on each of two search engines, the file you email should have 40 non-blank lines.

Problem 2 (20 points)

Courtesy James Allan http://www-ciir.cs.umass.edu/~allan/

A. The following list of R's and N's represents relevant (R) and non-relevant (N) documents in a ranked list of 50 documents. The top of the ranked list is on the left of the list, so that represents the most highly weighted document, the one that the system believes is most likely to be relevant. The list runs across the page to the right and then starts on the next line. This list shows 8 relevant documents. Assume that there are an additional two (2) relevant documents that were not retrieved by this system.

RRNNNRNNNN RNNNRNNNNR RNNNNNNNRN

NNNNNNNNNN NNNNNNNNNN

Based on that list, calculate the following measures:

1. Average precision

2. Precision at 50% recall

3. Precision at 33% recall (interpolated)

4. R-precision

B. Now, Imagine another system retrieves the following ranked list for the same query.

RRRRNNNNNN NNNNNNNNNN NNNNNNNNNN

NNNNNNNNNN NNNNNNNNNR

Repeat parts (A.1), (A.2) and (A.3) for the above ranked list. Compare the two ranked lists on the basis of these 3 metrics that you have computed--i.e., if you were given only these 3 numbers (Mean Average Precision, Precision at 50% recall and Precision at 33% recall) what can you determine about the relative performance of the two systems in general.

C. Plot a recall/precision graph for the above two systems. Generate both an uninter-polated and an interpolated graph (probably as two graphs to make the four plots easier to see). What do the graphs tell you about the system in A and the one in B?

Problem 3 (20 points)

Define 'perfect-retrieval' as the list of ranked documents where

- all the relevant documents are retrieved and

- every relevant document is ranked higher than any non-relevant one.

A.Prove that at a list demonstrates perfect-retrieval if and only if there exists a cutoff=c such that the list has PREC(c)=1 and RECALL(c)=1

B.Consider a particular cutoff c=10

B1. Give example of a list that has PREC(10)=1 and RECALL(10)<1

B2. Give example of a list that has PREC(10)<1 and RECALL(10)=1

C. Prove that a list demonstrates perfect-retrieval if and only if R_PREC=1

D. Prove that a list demonstrates perfect-retrieval if and only if AveragePrecision=1

Problem 4 (20 points)

Consider a typical precision-recall curve starting at A (RECALL=0,PREC=1) and ending at B(RECALL=1,PREC=0)as shown in the plot (1) below.

A.Every point in the precision-recall curve corresponds to a particular rank (or cutoff). Intersect the curve with the main diagonal given by equation y=x in point C (rp,rp). If the query in question has R relevant documents, which rank does this intersection point corresponds to?

B.The quantity rp also represents an important value. Explain which one and why.

C. As shown in plot(2), we will add the segments [AC] and [BC]. We are interested in the area under the precision-recall bluecurve and we can approximate it with the area under the redlines ACB. Compute this approximation area as a function of rp.

D. Explain how the average precision and R-precision measures are thus related .

Problem 5 (5 points)

Courtesy Huahai Yang http://www.albany.edu/~hy973732/

Which of the documents in table 4 will be retrieved given the Boolean query below?

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Table 1 Term-document weight in Boolean model

	Chaucer	Milton	Shakespeare	Swift
D1	0	0	0	0
D2	0	0	0	1
D3	0	0	1	0
D4	0	0	1	1
D5	0	1	0	0
D6	0	1	0	1
D7	0	1	1	0
D8	0	1	1	1
D9	1	0	0	0
D10	1	0	0	1
D11	1	0	1	0
D12	1	0	1	1
D13	1	1	0	0
D14	1	1	0	1
D15	1	1	1	0
D16	1	1	1	1

Problem 6 (15 points)

Courtesy Huahai Yang http://www.albany.edu/~hy973732/

Given a query Q and a collection of documents A, B and C shown in table 1, rank the similarity of A, B, C with Q using cosine metrics.

The numbers in the table are the weights between each term and document.

Table 2 Term-Document Weights

	Cat	Food	Fancy
Q	3	4	1
A	2	1	0
B	1	3	1
C	0	2	2