In this section, we will conduct a range of experiments with the
truncated models of [5], which we discussed in great
detail above.
Since our focus is the thresholding problem, we use an off-the-shelf
retrieval system: the vector-space model of Apache's
.
More information about the collection, topics, and evaluation measures can be found in the overview paper in this volume, and at the TREC Legal web-site.
For TREC Legal 2007 and 2008 we created the following runs:
This run is the run labeled
in
[4].
This runs is the basis for the official submissions labeled
-
,
-
, and
-
.
This run corresponds to our official submission labeled
-
.
We first discuss the overall quality of the rankings, and then the main
topic of this paper--estimating the cut-off
.
Run | ![]() |
![]() |
![]() |
Legal07 | 0.3302 | 0.1548 | 0.1328 |
---|---|---|---|
Legal08 | 0.4846 | 0.2036 | 0.1709 |
highest | 0.5923 | 0.2779 | 0.2173 |
median | 0.4154 | 0.2036 | 0.1709 |
lowest | 0.0538 | 0.0729 | 0.0694 |
The top half of Table 2 shows several measures on the two
underlying rankings,
Legal07 and Legal08. We show precision at 5 (all
top-5 results were judged by TREC);
estimated recall at
; and the
of the estimated precision and
recall at
(i.e. the estimated number of
relevant documents).
To determine the quality of our rankings in comparison to other
systems, we show the highest, lowest, and median performance of all
submissions in the bottom half of Table 2. As it turns
out,
Legal08 obtains exactly the median performance for
and
when using all relevant documents in
evaluation. Both rankings fare somewhat better than the median at
and in evaluating with the highly relevant documents only.
It is clear that our rankings are far from optimal in comparison with
the other submissions. On the negative side, this limits the
performance of the s-d method.
On the plus side, it makes our rankings good representatives of the
median-quality ranking.
2007 | 2008 | ||
Run | Truncation | ![]() |
![]() |
sd original | None | - | 0.0681
![]() |
---|---|---|---|
B | Theoretical | 0.0984 | 0.1361![]() |
A | Technical | 0.1011 | 0.1284![]() |
highest | - | 0.1848 | |
median | - | 0.0974 | |
lowest | - | 0.0051 |
All runs with the improved version of the s-d method lead to significantly better results. The B run use the theoretical truncation of Section 3.3.1, whereas the A runs use the technical truncation of Section 3.3.2. For 2007, the technically truncated model A is superior to the theoretically truncated model B. For 2008, the technically truncated A model lags somewhat behind the theoretically truncated B model. In comparison with the `old' non-truncated model, corresponding to our official TREC 2008 submission, both the truncated models obtain significantly better results.
We also show the highest, lowest, and median performance over the 23
submissions to TREC Legal 2008 (recall that the thresholding task is
new at TREC 2008, so there is no comparable data for 2007). Note that
the actual value of
is a result of both the quality of the
underlying ranking and choosing the right threshold. As seen
earlier, our ranking has the median
and
. With
the estimated threshold of the s-d model, the
is 0.1374,
well above the median score of 0.0974.
There is still amble room for improvement.
The
in Table 2 is 0.1328 for 2007 and 0.1709
for 2008, and we obtain 75-80% of these scores.
Obviously,
is not known in an operational system, and
serves as a soft upperbound on performance.
|
Figure 6 show the F
scores of the Legal 2008
B run, plotted against the "ceiling" of F
at the
estimated R. We will look in detail at some of the topics
from 2007 and 2008 B runs:
Figure 7 compares the prediction of the s-d model with the
official evaluation's estimated precision, recall, and F
.
Before discussing each of the topics in detail, an immediate
observation is that the estimated (non-interpolated) precision is
strikingly different from monotonically declining "ideal" precision
curves.
For Topic 73 (Legal 2007), the estimated
exceeds the length of
the ranking, and the
corresponds to the last found relevant
document at rank 22,091. The s-d model is clearly aiming too low and
estimates
at 2,720 and
at 2,593.
Topic 105 (Legal 2008) has an
of 34,424, well within the length
of the ranking, and the s-d model estimates an
of 36,503, near
to the real
, and an estimated
of 28,952. The divergence in
the prediction of
may be explained, in part, by the fact that
always corresponds to a point where a relevant document is
retrieved, and judged documents are very sparse down at this rank.
Topic 124 (Legal 2008) has an
of 20,083 and the s-d model
predicts an
of 51,231 and a
of 43,597. Here, the
is
overestimated but the
is very close to the
.
Topic 145 (Legal 2008) has an
of 91,790, very close to the
length of the ranking. The s-d model predict an
of 87,060 and a
of 91,590, both relatively close to the official evaluation
especially when bearing in mind that the
is again at the last
relevant document in the whole ranking.
avi (dot) arampatzis (at) gmail