d1: for english language model retireval have a relevance model while vectors space model retrieval dont

d2: R-precision measure is relevant to average precision measure

d3: most efficient retrieval models are language model and vector space model

d4: english is the most efficient language

d5: retrieval efficiency is measured by average precision

TERM – DOCUMENT TABLE (ignoring stopwords)

 

                  d1    d2    d3    d4    d5       term in corpus

english           1                 1                 2    

language          1           1     1                 3          

model             3           3                       6

retrieval         2           1           1           4

relevance         1     1                 1           3

vector            1           1                       2    

space             1           1                       2

R                       1                             1

most                          1     1                 2

efficient                           1     1           2

measure                 2                 1           3

average                 1                 1           2

precision               2                 1           3

 

DOC LENGTH        10    7     8     4     6           35

==============================================================================

T=35

D=5

U=13

avg_doc_length= 35/5=7

==============================================================================

QUERY

"efficient retrieval model efficient"

 

==============================================================================

RAWTF, binary QUERY  WEIGHTS

                  d1    d2    d3    d4    d5    QUERY

model             3           3                 1

retrieval         2           1           1     1

efficient                           1     1     2

                  5     0     4     1     1

==============================================================================

ROBERTSON TF, K=1

binary QUERY  WEIGHTS

                  d1    d2    d3    d4    d5    QUERY

model             3/4        3/4                 1

retrieval         2/3        1/2          1/2    1

efficient                           1/2   1/2    2

                  17/12  0    5/4   1/2   1

==============================================================================

OKAPITF, K=0.5

binary QUERY  WEIGHTS

                  d1    d2    d3    d4    d5    QUERY

model             0.53        0.57              1

retrieval         0.43        0.31        0.36  1

efficient                           0.42  0.36  2

                  .96  0     0.88  0.42  0.72

==============================================================================

OKAPITF, TF QUERY 

                  d1    d2    d3    d4    d5    QUERY

model             0.53        0.57              1

retrieval         0.43        0.31        0.36  1

efficient                           0.42  0.36  2

                  0.96  0     0.88  0.84  1.08

==============================================================================

IDF WEIGHTS

english           log(5/2)=1.32

language          log(5/3)=0.73

model             log(5/2)=1.32

retrieval         log(5/3)=0.73

relevance         log(5/3)=0.73

vector            log(5/2)=1.32

space             log(5/2)=1.32

R                 log(5/1)=2.32

most              log(5/2)=1.32

efficient         log(5/2)=1.32

measure           log(5/2)=1.32

average           log(5/2)=1.32

precision         log(5/2)=1.32

==============================================================================

OKAPITF*IDF, TF QUERY 

                                                     

model       0.53*1.32   0     0.57*1.32   0           0           1

retrieval   0.43*0.73   0     0.31*0.73   0           0.36*0.73   1

efficient   0           0     0           0.42*1.32   0.36*1.32   2

            1.01        0    0.97        1.10        1.21

==============================================================================

COSINE SIMILARITY

==============================================================================

LANG MODEL MAX-LIKELIHOOD ESTIMATE, document model

                  d1    d2    d3    d4    d5   

english           1/10              1/4                    

language          1/10        1/8   1/4                          

model             3/10        3/8                    

retrieval         2/10        1/8         1/6        

relevance         1/10  1/7               1/6        

vector            1/10        1/8                          

space             1/10        1/8                    

R                       1/7                          

most                          1/8   1/4              

efficient                           1/4   1/6        

measure                 2/7               1/6        

average                 1/7               1/6        

precision               2/7               1/6        

 

QUERY-LIKELIHOOD  0     0     0     0     0                

==============================================================================

LANG MODEL MAX-LIKELIHOOD+LAPLACE ESTIMATE, document model

                  d1          d2          d3          d4          d5   

english           2/23        1/20        1/21        2/17        1/19             

language          2/23        1/20        2/21        2/17        1/19                   

model             4/23        1/20        4/21        1/17        1/19       

retrieval         3/23        1/20        2/21        1/17        2/19       

relevance         2/23        2/20        1/21        1/17        2/19       

vector            2/23        1/20        2/21        1/17        1/19             

space             2/23        1/20        2/21        1/17        1/19       

R                 1/23        3/20        1/21        1/17        1/19       

most              1/23        1/20        2/21        2/17        1/19       

efficient         1/23        1/20        1/21        2/17        2/19       

measure           1/23        3/20        1/21        1/17        2/19       

average           1/23        2/20        1/21        1/17        2/19       

precision         1/23        3/20        1/21        1/17        2/19       

 

QUERY-LIKELIHOOD  12/23^4     1/20^4      8/21^4      2/17^4      4/19^4                 

==============================================================================

LANG MODEL MAX-LIKELIHOOD ESTIMATE, JELINEK-MERCER SMOOTHING document model

How to get background probability ? 2 ways, disscussion WILL USE SECOND WAY.

                  A. corpus ML estimate         B. average doc ML probabs

english           2/35                          (1/10 + 1/4)/5 = 0.07

language          3/35                          (1/10 + 1/8 + 1/4)/5 =0.09

model             6/35                          (3/10 + 3/8)/5 =0.13

retrieval         4/35                          (2/10 + 1/8 + 1/6)/5 = 0.1

relevance         3/35

vector            2/35

space             2/35

R                 1/35

most              2/35

efficient         2/35                          (1/4 + 1/6)/5= 0.08

measure           3/35

average           2/35

precision         3/35

 

USE L=lambda =0.8

                  d1                     

english           .2*1/10 + .8*0.07

language          .2*1/10 + .8*0.09

model             .2*3/10 + .8*0.13 = 0.16           

retrieval         .2*2/10 + .8*0.1  = 0.12           

relevance         .2*1/10 + .8*                

vector            .2*1/10 + .8*                

space             .2*1/10 + .8*                

R                                  

most                               

efficient         0.8 * 0.08        = 0.06     

measure                            

average                            

precision                          

 

QUERY-LIKELIHOOD  for d1: 0.16 * 0.12 * 0.06^2             

 

==============================================================================

LANG MODEL MAX-LIKELIHOOD ESTIMATE, WITTEN BELL SMOOTHING document model

                  A. corpus ML estimate         B. average doc ML probabs

english           2/35                          (1/10 + 1/4)/5 = 0.07

language          3/35                          (1/10 + 1/8 + 1/4)/5 =0.09

model             6/35                          (3/10 + 3/8)/5 =0.13

retrieval         4/35                          (2/10+1/8 +1/6)/5=0.1

relevance         3/35

vector            2/35

space             2/35

R                 1/35

most              2/35

efficient         2/35                          (1/4 + 1/6)/5= 0.08

measure           3/35

average           2/35

precision         3/35

 

                  d1         

N                 10         

V                 7          

english           10/17*1/10 + 7/17*0.07

language          10/17*1/10 + 7/17*0.09

model             10/17*3/10 + 7/17*0.13  =0.23      

retrieval         10/17*2/10 + 7/17*0.1   =0.16      

relevance         10/17*1/10 + 7/17*                 

vector            10/17*1/10 + 7/17*                 

space             10/17*1/10 + 7/17*                 

R                                  

most                               

efficient         7/17 * 0.08       =0.03

measure                            

average                            

precision                          

 

QUERY-LIKELIHOOD  for d1: 0.23 * 0.16 * 0.03^2             

                 

 

==============================================================================

LANG MODEL MAX-LIKELIHOOD+LAPLACE , query model

QUERY MODEL

english           1/17

language          1/17

model             2/17       

retrieval         2/17       

relevance         1/17

vector            1/17

space             1/17

R                 1/17

most              1/17

efficient         3/17

measure           1/17

average           1/17

precision         1/17

 

                  d1    d2    d3    d4    d5   

english           1                 1                 2    

language          1           1     1                 3          

model             3           3                       6

retrieval         2           1           1           4

relevance         1     1                 1           3

vector            1           1                       2    

space             1           1                       2

R                       1                             1

most                          1     1                 2

efficient                           1     1           2

measure                 2                 1           3

average                 1                 1           2

precision               2                 1           3

           

DOC-LIKELIHOOD for d1: 1/17 * 1/17 * (2/17)^3 * (2/17)^2 * 1/17 * 1/17 * 1/17                        

PROBLEM : normalize with doc length , otherwise large docs get a very low score

==============================================================================

KL DISTANCE model comparison. LAPLACE SMOOTHING

            d1          d2          d3          d4          d5    QUERY

english     2/23        1/20        1/21        2/17        1/19  1/17 

language    2/23        1/20        2/21        2/17        1/19  1/17 

model       4/23        1/20        4/21        1/17        1/19  2/17

retrieval   3/23        1/20        2/21        1/17        2/19  2/17

relevance   2/23        2/20        1/21        1/17        2/19  1/17

vector      2/23        1/20        2/21        1/17        1/19  1/17

space       2/23        1/20        2/21        1/17        1/19  1/17

R           1/23        3/20        1/21        1/17        1/19  1/17

most        1/23        1/20        2/21        2/17        1/19  1/17

efficient   1/23        1/20        1/21        2/17        2/19  3/17

measure     1/23        3/20        1/21        1/17        2/19  1/17

average     1/23        2/20        1/21        1/17        2/19  1/17

precision   1/23        3/20        1/21        1/17        2/19  1/17

 

FOR d1 : distance = - 1/17*log(2/23) - 1/17*log(2/23) - 2/17*log(4/23) - 2/17*log(3/23).....

 

FOR d2 : distance = - 1/17*log(1/20) - 1/17*log(1/20) - 2/17*log(1/20) - 2/17*log(1/20).....

 

FOR d3 : distance = - 1/17*log(1/21) - 1/17*log(2/21) - 2/17*log(4/21) - 2/17*log(2/21).....

==============================================================================

 

 

english    

language   

model      

retrieval  

relevance        

vector           

space            

R                

most             

efficient        

measure          

average          

precision