In this section, we will conduct a range of experiments with the truncated models of [5], which we discussed in great detail above. Since our focus is the thresholding problem, we use an off-the-shelf retrieval system: the vector-space model of Apache's .
More information about the collection, topics, and evaluation measures can be found in the overview paper in this volume, and at the TREC Legal web-site.
For TREC Legal 2007 and 2008 we created the following runs:
This run is the run labeled in [4].
This runs is the basis for the official submissions labeled - , - , and - .
This run corresponds to our official submission labeled - .
We first discuss the overall quality of the rankings, and then the main topic of this paper--estimating the cut-off .
Run | |||
Legal07 | 0.3302 | 0.1548 | 0.1328 |
---|---|---|---|
Legal08 | 0.4846 | 0.2036 | 0.1709 |
highest | 0.5923 | 0.2779 | 0.2173 |
median | 0.4154 | 0.2036 | 0.1709 |
lowest | 0.0538 | 0.0729 | 0.0694 |
The top half of Table 2 shows several measures on the two underlying rankings, Legal07 and Legal08. We show precision at 5 (all top-5 results were judged by TREC); estimated recall at ; and the of the estimated precision and recall at (i.e. the estimated number of relevant documents).
To determine the quality of our rankings in comparison to other systems, we show the highest, lowest, and median performance of all submissions in the bottom half of Table 2. As it turns out, Legal08 obtains exactly the median performance for and when using all relevant documents in evaluation. Both rankings fare somewhat better than the median at and in evaluating with the highly relevant documents only. It is clear that our rankings are far from optimal in comparison with the other submissions. On the negative side, this limits the performance of the s-d method. On the plus side, it makes our rankings good representatives of the median-quality ranking.
2007 | 2008 | ||
Run | Truncation | ||
sd original | None | - | 0.0681 |
---|---|---|---|
B | Theoretical | 0.0984 | 0.1361 |
A | Technical | 0.1011 | 0.1284 |
highest | - | 0.1848 | |
median | - | 0.0974 | |
lowest | - | 0.0051 |
All runs with the improved version of the s-d method lead to significantly better results. The B run use the theoretical truncation of Section 3.3.1, whereas the A runs use the technical truncation of Section 3.3.2. For 2007, the technically truncated model A is superior to the theoretically truncated model B. For 2008, the technically truncated A model lags somewhat behind the theoretically truncated B model. In comparison with the `old' non-truncated model, corresponding to our official TREC 2008 submission, both the truncated models obtain significantly better results.
We also show the highest, lowest, and median performance over the 23 submissions to TREC Legal 2008 (recall that the thresholding task is new at TREC 2008, so there is no comparable data for 2007). Note that the actual value of is a result of both the quality of the underlying ranking and choosing the right threshold. As seen earlier, our ranking has the median and . With the estimated threshold of the s-d model, the is 0.1374, well above the median score of 0.0974.
There is still amble room for improvement. The in Table 2 is 0.1328 for 2007 and 0.1709 for 2008, and we obtain 75-80% of these scores. Obviously, is not known in an operational system, and serves as a soft upperbound on performance.
|
Figure 6 show the F scores of the Legal 2008 B run, plotted against the "ceiling" of F at the estimated R. We will look in detail at some of the topics from 2007 and 2008 B runs:
Figure 7 compares the prediction of the s-d model with the official evaluation's estimated precision, recall, and F . Before discussing each of the topics in detail, an immediate observation is that the estimated (non-interpolated) precision is strikingly different from monotonically declining "ideal" precision curves.
For Topic 73 (Legal 2007), the estimated exceeds the length of the ranking, and the corresponds to the last found relevant document at rank 22,091. The s-d model is clearly aiming too low and estimates at 2,720 and at 2,593.
Topic 105 (Legal 2008) has an of 34,424, well within the length of the ranking, and the s-d model estimates an of 36,503, near to the real , and an estimated of 28,952. The divergence in the prediction of may be explained, in part, by the fact that always corresponds to a point where a relevant document is retrieved, and judged documents are very sparse down at this rank.
Topic 124 (Legal 2008) has an of 20,083 and the s-d model predicts an of 51,231 and a of 43,597. Here, the is overestimated but the is very close to the . Topic 145 (Legal 2008) has an of 91,790, very close to the length of the ranking. The s-d model predict an of 87,060 and a of 91,590, both relatively close to the official evaluation especially when bearing in mind that the is again at the last relevant document in the whole ranking.
avi (dot) arampatzis (at) gmail