Lemur CGI details

Courtesy James Allan http://www-ciir.cs.umass.edu/~allan/
Courtesy Hema Raghavan http://ciir.cs.umass.edu/~hema/

Created: Mon 2 May 2005
Last modified: 

The Lemur CGI Interface is available from: http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?


Q : Why is http://kerf.ccs.neu.edu/ not working ?

IMPORTANT : from "community ports" http://kerf.ccs.neu.edu/ is not accessible (we are working on this). Instead use the NAT http://10.0.0.204/ . However, this may be a permanent problem; as a result, if u want your code to work both from community ports and from other net outlets(including outside CCIS network) you should :

The instructions about accessing this interface are given below.

The basic format of a command to access the CGI is to access the above URL followed by a "?" which is followed by a command. The command is in the form: Name=value&Name=value&....&Name=value. The Name,value pairs are arguments to the CGI script. They are processed left to right. I'll explain why this is important in the section marked IMPORTANT TIPS below. The CGI program returns a web-page that contains a header (everything before the BODY tag), the results, and a footer (everything after the HR tag). The header and footer can be ignored. They are the same for every command, and their only purpose is to identify the software.

1. To get help on the syntax use the "?h" command as follows http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?h

  1. For example, to obtain corpus statistics for a word "star", the

command is http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?c=stars The 'c' command returns the properties of the word following 'c=' (that is what the stem of the word is, whether it is a stopword or no) and the corpus statistics for that word. The 't' command returns the word properties of the word and the corpus statistics of the stem of the word. Use the 't' command only for stemmed databases, and the 'c' command for unstemmed databases if you want real statistics or you will get odd effects (Read point upto point 'd' below to understand how to specify databases to the CGI). The 'm' command returns the stem of the word.

b. Documents have both internal and external ids. External ids like WSJ890803-0148 are typically a combination of source and date information. To retrieve a document with it's external ID use the "e?=" command. For example, http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?e=AP890101-0001 will return the document AP890101-0001 in SGML format. To retrieve a document in SGML format using the internal ID, use the "i=" command, for example: http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?i=1

c. Use "?v=" to get the inverted list of a term. http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?v=star For the v command you need to specify the stemmed form of the word (This is of course if the index is stemmed). There is a corresponding 'x' command which will stem the word and return the inverted list of the stemmed word.

d. The database currently consists of documents from TREC disks 1 and 2. Information on the corpora and on TREC can be obtained from the TREC website. Queries and relevance judgements can also be obtained from the trec website. For each of the above queries, you can specify an index of choice using the "d=" command. http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?d=? lists the available databases or indices. Index 0 has no stopping or stemming done to it. Index 1 has has no stopping but stemming is done using the Krovetz stemmer. Index 2 has no stemming but stop words are removed using a Stop-wordList. Index 3 has both stopping and stemming done to it. All 4 indices refer to the same corpus which is described next. To query the word star in index 0 use: http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?d=0&c=star

To get information on the various available databases use http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?d=?This will give you corpus statistics and information about whether the database was stemmed or stopped.

2. IMPORTANT TIPS:

  1. Commands are processed left to right. Hence, parameters like database ids, or length of the ranked list must be specified before the query. For example, if you specify http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?n=5&q=star, you get back the top 5 documents for the query "star". However, http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?q=star&n=5 has little meaning, since the CGI processes commands from left to right and the value of the length of the ranked list would be set only after

"q=star" is processed and returned to the client, which has no meaning. Similarly http://kerf.ccs.neu.edu/vip/lemurcgi/lemur.cgi?d=0&c=star returns the statistics of the word "star" in database 0.