d1: for english
language model retireval have a relevance model while vectors space model
retrieval dont
d2: R-precision
measure is relevant to average precision measure
d3: most efficient
retrieval models are language model and vector space
model
d4: english is the
most efficient language
d5: retrieval
efficiency is measured by average precision
TERM – DOCUMENT
TABLE (ignoring stopwords)
d1 d2
d3 d4 d5
term in
corpus
english 1 1 2
language 1 1
1 3
model 3 3 6
retrieval 2 1 1 4
relevance 1
1 1 3
vector 1 1 2
space 1 1 2
R 1
1
most 1 1 2
efficient 1 1
2
measure 2 1 3
average 1 1 2
precision 2 1 3
DOC LENGTH 10
7 8 4
6
35
==============================================================================
T=35
D=5
U=13
avg_doc_length=
35/5=7
==============================================================================
QUERY
"efficient
retrieval model efficient"
==============================================================================
RAWTF, binary
QUERY WEIGHTS
d1 d2
d3 d4 d5 QUERY
model 3 3 1
retrieval 2 1 1
1
efficient 1 1
2
5 0
4 1 1
==============================================================================
ROBERTSON TF, K=1
binary QUERY WEIGHTS
d1 d2
d3 d4 d5 QUERY
model 3/4 3/4 1
retrieval 2/3 1/2
1/2 1
efficient 1/2 1/2 2
17/12 0 5/4 1/2 1
==============================================================================
OKAPITF, K=0.5
binary QUERY WEIGHTS
d1 d2
d3 d4 d5 QUERY
model 0.53 0.57 1
retrieval 0.43 0.31 0.36
1
efficient 0.42 0.36
2
.96 0 0.88 0.42 0.72
==============================================================================
OKAPITF, TF
QUERY
d1 d2
d3 d4 d5 QUERY
model 0.53 0.57 1
retrieval 0.43 0.31 0.36
1
efficient 0.42 0.36
2
0.96 0
0.88 0.84 1.08
==============================================================================
IDF WEIGHTS
english log(5/2)=1.32
language log(5/3)=0.73
model
log(5/2)=1.32
retrieval log(5/3)=0.73
relevance
log(5/3)=0.73
vector log(5/2)=1.32
space
log(5/2)=1.32
R
log(5/1)=2.32
most
log(5/2)=1.32
efficient log(5/2)=1.32
measure log(5/2)=1.32
average log(5/2)=1.32
precision log(5/2)=1.32
==============================================================================
OKAPITF*IDF, TF
QUERY
model 0.53*1.32 0
0.57*1.32 0 0 1
retrieval 0.43*0.73
0 0.31*0.73 0
0.36*0.73
1
efficient 0
0 0 0.42*1.32 0.36*1.32
2
1.01 0 0.97
1.10
1.21
==============================================================================
COSINE SIMILARITY
==============================================================================
LANG MODEL
MAX-LIKELIHOOD ESTIMATE, document model
d1 d2
d3 d4 d5
english 1/10 1/4
language 1/10 1/8
1/4
model 3/10 3/8
retrieval 2/10 1/8 1/6
relevance 1/10
1/7 1/6
vector 1/10 1/8
space 1/10 1/8
R 1/7
most 1/8 1/4
efficient 1/4 1/6
measure 2/7 1/6
average 1/7 1/6
precision 2/7 1/6
QUERY-LIKELIHOOD 0
0 0 0
0
==============================================================================
LANG MODEL
MAX-LIKELIHOOD+LAPLACE ESTIMATE, document model
d1 d2 d3 d4 d5
english 2/23 1/20 1/21 2/17 1/19
language 2/23 1/20 2/21 2/17 1/19
model 4/23 1/20 4/21 1/17 1/19
retrieval 3/23 1/20 2/21 1/17 2/19
relevance 2/23 2/20 1/21 1/17 2/19
vector
2/23 1/20 2/21 1/17 1/19
space 2/23 1/20 2/21 1/17 1/19
R 1/23 3/20 1/21 1/17 1/19
most 1/23 1/20 2/21 2/17 1/19
efficient
1/23 1/20 1/21 2/17 2/19
measure 1/23 3/20 1/21 1/17 2/19
average 1/23 2/20 1/21 1/17 2/19
precision 1/23 3/20 1/21 1/17 2/19
QUERY-LIKELIHOOD 12/23^4
1/20^4 8/21^4 2/17^4 4/19^4
==============================================================================
LANG MODEL
MAX-LIKELIHOOD ESTIMATE, JELINEK-MERCER SMOOTHING document
model
How to get
background probability ? 2 ways, disscussion WILL USE SECOND
WAY.
A. corpus ML estimate B. average doc ML
probabs
english 2/35 (1/10 + 1/4)/5 =
0.07
language 3/35 (1/10 + 1/8 + 1/4)/5
=0.09
model 6/35 (3/10 + 3/8)/5
=0.13
retrieval 4/35 (2/10 + 1/8 + 1/6)/5
= 0.1
relevance 3/35
vector 2/35
space 2/35
R 1/35
most 2/35
efficient 2/35 (1/4 + 1/6)/5=
0.08
measure 3/35
average 2/35
precision 3/35
USE L=lambda
=0.8
d1
english .2*1/10 + .8*0.07
language .2*1/10 + .8*0.09
model .2*3/10 + .8*0.13 = 0.16
retrieval .2*2/10 + .8*0.1 = 0.12
relevance .2*1/10 + .8*
vector .2*1/10 + .8*
space .2*1/10 + .8*
R
most
efficient 0.8 * 0.08 = 0.06
measure
average
precision
QUERY-LIKELIHOOD for d1: 0.16 * 0.12 * 0.06^2
==============================================================================
LANG MODEL
MAX-LIKELIHOOD ESTIMATE, WITTEN BELL SMOOTHING document
model
A. corpus ML estimate B. average doc ML
probabs
english 2/35 (1/10 + 1/4)/5 =
0.07
language 3/35 (1/10 + 1/8 + 1/4)/5
=0.09
model 6/35 (3/10 + 3/8)/5
=0.13
retrieval 4/35 (2/10+1/8
+1/6)/5=0.1
relevance 3/35
vector 2/35
space 2/35
R 1/35
most 2/35
efficient 2/35 (1/4 + 1/6)/5=
0.08
measure 3/35
average 2/35
precision 3/35
d1
N 10
V 7
english 10/17*1/10 + 7/17*0.07
language 10/17*1/10 + 7/17*0.09
model 10/17*3/10 + 7/17*0.13 =0.23
retrieval 10/17*2/10 + 7/17*0.1 =0.16
relevance 10/17*1/10 + 7/17*
vector 10/17*1/10 + 7/17*
space 10/17*1/10 + 7/17*
R
most
efficient 7/17 * 0.08 =0.03
measure
average
precision
QUERY-LIKELIHOOD for d1: 0.23 * 0.16 * 0.03^2
==============================================================================
LANG MODEL
MAX-LIKELIHOOD+LAPLACE , query model
QUERY
MODEL
english 1/17
language 1/17
model 2/17
retrieval 2/17
relevance 1/17
vector 1/17
space 1/17
R 1/17
most 1/17
efficient
3/17
measure 1/17
average 1/17
precision 1/17
d1 d2
d3 d4 d5
english 1 1 2
language 1 1
1 3
model 3 3
6
retrieval 2 1 1 4
relevance 1
1 1 3
vector 1 1 2
space 1 1 2
R 1
1
most 1 1 2
efficient 1 1
2
measure 2 1 3
average 1 1 2
precision 2 1 3
DOC-LIKELIHOOD for
d1: 1/17 * 1/17 * (2/17)^3 * (2/17)^2 * 1/17 * 1/17 * 1/17
PROBLEM :
normalize with doc length , otherwise large docs get a very low
score
==============================================================================
KL DISTANCE model
comparison. LAPLACE SMOOTHING
d1 d2 d3 d4 d5
QUERY
english 2/23
1/20 1/21 2/17 1/19
1/17
language 2/23
1/20 2/21 2/17 1/19
1/17
model 4/23 1/20 4/21 1/17 1/19 2/17
retrieval 3/23
1/20 2/21 1/17 2/19
2/17
relevance 2/23
2/20 1/21 1/17 2/19
1/17
vector 2/23 1/20 2/21 1/17 1/19
1/17
space 2/23 1/20 2/21 1/17 1/19
1/17
R 1/23 3/20 1/21 1/17 1/19
1/17
most 1/23 1/20 2/21 2/17 1/19
1/17
efficient
1/23 1/20 1/21 2/17 2/19
3/17
measure 1/23
3/20 1/21 1/17 2/19
1/17
average 1/23
2/20 1/21 1/17 2/19
1/17
precision 1/23
3/20 1/21 1/17 2/19
1/17
FOR d1 : distance
= - 1/17*log(2/23) - 1/17*log(2/23) - 2/17*log(4/23) -
2/17*log(3/23).....
FOR d2 : distance
= - 1/17*log(1/20) - 1/17*log(1/20) - 2/17*log(1/20) -
2/17*log(1/20).....
FOR d3 : distance
= - 1/17*log(1/21) - 1/17*log(2/21) - 2/17*log(4/21) -
2/17*log(2/21).....
==============================================================================
english
language
model
retrieval
relevance
vector
space
R
most
efficient
measure
average
precision