25 Jan

Significance Test is Significant

One of common questions IR researchers ask is: ‘Is this new retrieval method better than the old one?’ Mostly, we turn to Statistics for the answer, which is the method called ‘significance test’

Given two sets of performance measurements(typically MAP or Precision@K) for both systems, we run significance test and get the probability that the both result set is from the same distribution. If this probability is smaller than some predetermined value (e.g. 0.05), we know that there is no significant difference between the performance of these systems.

In Statistical terms, the probability here is called ‘P-value’ and the assumption that both set is from the same distribution is called ‘Null Hypothesis’, which we may hope to deny. (especially if we devised this new method)

As you may guess, there are many methods for significance test used for IR, differentiated by the assumptions they make—underlying distributions, and so on. According to recent paper in which these methods are compared, it is found that randomization test, bootstrapping test and t-test shows the same result, while Wilcoxon and sign test, simplified forms of randomization test, shows different result from others and therefore discouraged from the use.

Tags : Essay,IR,Statistics Print Comments Trackback