The normal-exponential mixture has worked best under the availability of some relevance judgments which serve as an indication about the form of the component densities [8,3,22]. In filtering or classification, usually some training data--although often biased--are available. In the current task, however, no relevance information is available.
A method was introduced in the context of fusion which recovers the component densities without any relevance judgments using the Expectation Maximization (EM) algorithm [14]. In order to deal with the biased training data in filtering, the EM method was also later adapted and applied for thresholding tasks [1].4 Nevertheless, EM was found to be "messy" and sensitive to its initial parameter settings [1,14]. We will improve upon this estimation method in Section 4.3.
For practical reasons, rankings are usually truncated at some rank . Even what is usually considered a full ranking is in fact a collection's subset of those documents with at least one matching term with the query.
This fact has been largely ignored in all previous research using the standard model, despite that it may affect greatly the estimation. For example, in TREC Legal 2007 and 2008, was and respectively. This results to a left-truncation of which at least in the case of the 2007 data is significant. For 2007 it was estimated that there were more than relevant documents for 13 of the 43 Ad Hoc topics (to a high of more than ) and the median system was still achieving precision at ranks of to .
Additionally, considering that the exponential may not be a good model for the whole distribution of the non-relevant scores but only for their high end, some imposed truncation may help achieve better fits. Consequently, all estimations should take place at the top of the ranking, and then get extrapolated to the whole collection. The truncated models of [5] require changes in the estimation formulas.
Let us assume that the truncation score is . For both truncated models, we we need to estimate a two-side truncated normal at and , and a shifted exponential by right-truncated at , with possibly be . Thus, the formulas that should be used are Equations 9 and 10 but for instead of
If is the fraction of relevant documents in the truncated ranking, extrapolating the truncated normal outside its estimation range and appropriately per model in order to account for the remaining relevant documents, the is calculated as:
In estimating the technically truncated model, if there are any scores equal to or they should be removed from the data-set; these belong to the discontinuous legs of the densities given in Section 3.3.2. In this case, should be decremented accordingly. In practise, while scores equal to should not exist in the top- due to the down-truncation, some scores may very well be in the data. Removing these during estimation is a simplifing approximation with an insignificant impact when the relevant documents are many and the bulk of their score distribution is below , as it is the case in current experimental setup. As we will see next, while we do not use the scores during fitting, we take them into account during goodness-of-fit testing; using multiple such fitting/testing rounds, this reduces the impact of the approximation.
Our scores have a resolution of . Obviously, LUCENE rounds or truncates the output scores, destroying information. In order to smooth out the effect of rounding in the data, we add to each datum point, where returns a uniformly-distributed real random number in .
Beyond using all scores available and in order to speed up the calculations, we also tried stratified down-sampling to keep only 1 out of 2, 3, or 10 scores.5Before any down-sampling, all datum points were smoothed by replacing them with their average value in a surrounding window of 2, 3, or 10 points, respectively.
In order to obtain better exponential fits we may further left-truncate the rankings at the mode of the observed distribution. We bin the scores (as described in Section A.1), find the bin with the most scores, and if that is not the leftmost bin then we remove all scores in previous bins.
EM is an iterative procedure which converges locally [9]. Finding a global fit depends largely on the initial settings of the parameters.
We tried numerous initial settings, but no setting seemed universal. While some settings helped a lot some fits, they had a negative impact on others. Without any indication of the form, location, and weighting of the component densities, the best fits overall were obtained for randomized initial values, preserving also the generality of the approach:6
where is the maximum score datum, and are respectively the mean and variance of the score data, is an arbitrary small constant which we set equal to the width of the bins (see Appendix A.1), and is another constant which we explain below.
Assuming that no information is available about the expected , not much can be done for , so it is randomized using its whole supported range. Next we assume that right-truncation of the exponential is insignificant, which seems to be the case in our current experimental set-up.
If there are no relevant documents, then . From the last equation we deduce the minimum . Although in general, there is no reason why the exponential cannot fall slower that this, from an IR perspective it should not, or would get higher than .
The given is suitable for a full normal, and its range should be expanded in both sides for a truncated one because the mean of the corresponding full normal can be below or above . Further, can be restricted based on the hypothesis that for good systems should hold that . We have not worked out these improvements.
The variance of the initial exponential is . Assuming that the random variables corresponding to the normal and exponential are uncorrelated, the variance of the normal is which, depending on how is initialized, could take values . To avoid this, we take the max with the constant. For an insignificantly truncated normal, , while in general , because the variance of the corresponding full normal is larger than what is observed in the truncated data. We set , however, we found its usable range to be .
For observed scores , and neither truncated nor shifted normal and exponential densities (i.e. for the original model), the update equations are
is given by Bayes' rule , , , and by Equation 3.1.
We initialize those equations as described above, and iterate them until the absolute differences between the old and new values for , , and are all less than , and . Like this we target an accuracy of 0.1% for scores and 1 in a 1,000 for documents. We also tried a target accuracy of 0.5% and 5 in 1,000, but it did not seem sufficient.
If we use the truncated densities (Equations 9 and 10) in the above update equations, the and calculated at each iteration would be the expected value and variance of the truncated normal, not the and we are looking for. Similarly, would be equal to the expected value of the shifted truncated exponential. Instead of looking for new EM equations, we rather correct to the right values using simple approximations.
Using Equation 26, at the end of each iteration we correct the calculated as
again using the values from the previous iteration.
These simple approximations work, but sometimes they seem to increase the number of iterations needed for convergence, depending on the accuracy targeted. Rarely, and for high accuracies only, the approximations possibly handicap EM convergence; the intended accuracy is not reached for up to 1,000 iterations. Generally, convergence happens in 10 to 50 iterations depending on the number of scores (more data, slower convergence), and even with the approximation EM produces considerably better fits than when using the non-truncated densities. To avoid getting locked in a non-converging loop, despite its rarity, we cap the number of iterations to 100. The end-differences we have seen between the observed and expected numbers of documents due to these approximations have always been less than 4 in 100,000.
We initialize and run EM as described above. After EM stops, we apply the goodness-of-fit test for the observed data and the recovered mixture (see Appendix A). If the null hypothesis is rejected, we randomize again the initial values and repeat EM for up to 100 times or until cannot be rejected. If is rejected in all 100 runs, we just keep the best fit found. We run EM at least 10 times, even if we cannot reject earlier. Perhaps a maximum of 100 EM runs is an overkill, but we found that there is significant sensitivity to initial conditions.
Some fits, irrespective of their quality, can be rejected on IR grounds. Firstly, it should hold that , however, since each fit corresponds to non-relevant documents, we can tighten the inequality somewhat to:
(15) |
run | no reject | comments | |||||
2007-default | 56.5 | 4 (8%) | 2 (4%) | 33 (66%) | 29 | 5 (10%) | no smth or sampling |
2007-A | 37 | 1 (2%) | 32 (64%) | 40 (80%) | 34 | 5 (10%) | smth + 1/3 strat. sampl. |
2007-B | 36 | 1 (2%) | 30 (60%) | 32 (64%) | 61.5 | 1 (2%) | smth + 1/3 strat. sampl. |
2008-default | 93 | 6 (13%) | 0 (0%) | 29 (64%) | 89 | 0 (0%) | no smth or sampling |
2008-A | 63 | 1 (2%) | 5 (11%) | 30 (67%) | 98 | 0 (0%) | smth + 1/3 strat. sampl. |
2008-B | 66 | 4 (9%) | 9 (20%) | 31 (69%) | 45 | 1 (2%) | smth +1/3 strat. sampl. |
We are still experimenting with such conditions, and we have not applied them for producing any of the end-results reported in this paper.
Down-sampling has the effect of eliminating some of the right tails, leading to fewer bins when binning the data. Moreover, the fewer the scores, the less EM iterations and runs are needed for a good fit (data not shown). Down-sampling the scores helps supporting the . At 1 out of 3 stratified sampling, the cannot be rejected at a significance level of 0.05 for 60-64% of the 2007 topics and for 20% for the 2008 topics. Non-stratified down-sampling with 0.1 probability raises this to 42% for the 2008 topics. Extreme down-sampling to keep only around 1,000 to 5,000 scores supports the in almost all fits.
Consequently, the number of scores and bins plays a big role in the quality of the fits according to the test; there is a positive correlation between the median number of bins and the percentage of rejected . This effect does not seem to be the result of information loss due to down-sampling; we still get more support for the when reducing the number of scores by down-truncating the rankings instead of down-sampling them. This is an unexpected result; we rather expected that the increased number of scores and bins is dealt with by the increased degrees of freedom parameter of the corresponding distributions. Irrespective of sampling and binning, however, all fits look reasonably well to the eye.
In all runs, for a small fraction of topics (2-13%) the optimum number of bins is near ( difference) to our capped value of 200. For most of these topics, when looking for the optimal number of bins in the range (numbers are tried with a step of 5%) the binning method does not converge. This means there is no optimal binning as the algorithm identifies the discrete structure of data as being a more salient feature than the overall shape of the density function. Figure 4 demonstrates this.
|
Since the scores are already randomized to account for rounding (Section 4.2), the discrete structure of the data is not a result of rounding but it rather comes from the retrieval model itself. Internal statistics are usually based on document and word counts; when these are low, statistics are "rough", introducing the discretization effect.
Concerning the theoretical anomaly of the normal-exponential mixture, we investigate the number of fits presenting the anomaly within the observed score range, i.e. at a rank below rank-1 ( ).7 We see that the anomaly shows up in a large number of topics (64-80%). The impact of non-convexity on the s-d method is that the method turns "blind" at rank numbers restricting the estimated optimal thresholds wtih . However, the median rank number down to which the problem exists is very low compared to the median estimated number of relevant documents (7,484 or 32,233), so is unlikely on average anyway and thresholding should not be affected. Consequently, the data suggest that the non-convexity should have an insignificant impact on s-d thresholding.
For a small number of topics (0-10%), the problem appears for and non-convexity should have a significant impact. Still, we argue that for a good fraction of such topics, a large indicates a fitting problem rather than a theoretical one. Figure 5 explains this further.
|
Since there are more data at lower scores, EM results in parameter estimates that fit the low scores better than the high scores. This is exactly the opposite of what is needed for IR purposes, where the top of rankings is more important. It also introduces a weak but undesirable bias: the longer the input ranked list, the lower the estimates of the means of the normal and exponential; this usually results in larger estimations of and .
Trying to neutralize the bias without actually removing it, input ranking lengths can better be chosen according to the expected . This also makes sense for the estimation method irrespective of biases: we should not expect much when trying to estimate, e.g., an of 100,000 from only the top-1000. As a rule-of-thumb, we recommend input ranking lengths of around 1.5 times the expected with a minimum of 200. According to this recommendation, the 2007 rankings truncated at 25,000 are spot on, but the 100,000 rankings of 2008 are falling short by 20%.
Recovering the mixture with EM has been proven to be "tricky". However, with the improvements presented in this paper, we have reached a rather stable behavior which produces usable fits.
EM's initial parameter settings can further be tightened resulting in better estimates in less iterations and runs, but we have rather been conservative in order to preserve the generality.
As a result of how EM works--giving all data equal importance--a weak but undesirable ranking-length bias is present: the longer the input ranking, the larger the estimates. Although the problem can for now be neutralized by choosing the input lengths in accordance with the expected , any future improvements of the estimation method should take into account that score data are of unequal importance: data should be fitted better at their high end.
Whatever the estimation method, conditions for rejecting fits on IR grounds such as those investigated in Section 4.3.5, seem to have a potential for further considerable improvements.