25 Jan

Significance Test is Significant

One of common questions IR researchers ask is: ‘Is this new retrieval method better than the old one?’ Mostly, we turn to Statistics for the answer, which is the method called ‘significance test’

Given two sets of performance measurements(typically MAP or Precision@K) for both systems, we run significance test and get the probability that the both result set is from the same distribution. If this probability is smaller than some predetermined value (e.g. 0.05), we know that there is no significant difference between the performance of these systems.

In Statistical terms, the probability here is called ‘P-value’ and the assumption that both set is from the same distribution is called ‘Null Hypothesis’, which we may hope to deny. (especially if we devised this new method)

As you may guess, there are many methods for significance test used for IR, differentiated by the assumptions they make—underlying distributions, and so on. According to recent paper in which these methods are compared, it is found that randomization test, bootstrapping test and t-test shows the same result, while Wilcoxon and sign test, simplified forms of randomization test, shows different result from others and therefore discouraged from the use.

Tags : Essay,IR,Statistics Print Comments Trackback
21 Jan

Recommended Reading for IR Research Student

This is the survey article I found while taking IR class last Fall. While this article seemed interesting from its title, I couldn’t get the good grasp of this one as I had little understanding of the IR field in general.

When I read this one again a few days ago, I could finish this one with greater interest. Not only did it provide me with a well-chosen reading list, but also it gave me a clue on IR research trends seen from the perspective of papers published and well-received.

Since my life as a grad student may circle around the papers, it should be worthwhile to summarize lessons I learned.

What makes a ‘classic’ paper?

My first curiosity was why these handful of papers were chosen among thousands of IR papers that came out to the world so far. What made them so special?

Novelty

Any research paper should be ‘new’ in some ways—that’s what makes it a ‘research’. Yet many of these classic papers are greater and more beneficial in their novelty. Some of them started to ask questions people have never even thought of before, some others provided a whole new perspective to an existing problem, still others applied existing technique and theory to a new venue of problems or brought in the knowledge of other field to solve an IR problem.

While selected papers are top-quality in most other criteria, some were selected despite their obvious limitation in methodology or performance, from which we can see the value of bringing in new ideas and approaches.

Result

Since IR is a field rich in performance metric—although few seem to know which one is the best, a work with improved performance is noteworthy. Based on my 5-month long observation in CIIR, there seem to be many cases in which a method with superior result comes out first—by some chance or mistake(!), followed by theoretical justification.

Of course, given that their performance improvement is consistent and significant, most of these ‘result’ papers are proven to be novel later on.

Methodology

If I say that ‘novelty’ papers are excellent in finding a problem worth-solving, some papers draw attention for how they solved the problems. Even without groundbreaking idea or superior result, these papers are read by many people as they teach valuable lessons—mostly in terms of experiment design and interpretation. These ‘methodology’ papers should be especially valuable for students just entering the field.

Survey

As a topic is established as a field of research and the result accumulates for some time, it becomes increasingly for individual researchers to follow-up the result of past research. That’s where the survey papers are needed, in which most of major discoveries are summarized in a single paper.

Which track should one pursue?

Given these conditions for good papers, researchers may ask themselves what their strategy should be here, since most papers seem to have strength in a particular criteria—although there a good number of papers qualified in every perspective.

Here’s my crude suggestion. (Although I know I don’t know well enough to make this kind of remark.)
  • Novelty Papers : If you are confident about your creative potential—you tend to pinpoint things most people may not come up with. To be successful this way, you should read a wide breadth of literature (even in related fields), which may give you a useful combination of ideas no one thought of before.
  • Result Papers : If you’re good at tweaking with a variety of retrieval parameters and settings, getting superior result may be easier for you. All you need to do is find the theory that ‘explains’ your result.
  • Methodology Papers : If you have considerable experience and rigor to investigate given issue better than most people, you may turn most problem you work on into quality research paper—even without good result(!).
  • Survey Papers : This kind of work would probably be left to guru-level research whose research career shows the advancement of the field itself.

Reference

Tags : Essay,IR Print Comments Trackback
29 Nov

Having a Right Measure for IR

This might be my first posting as a IR research. I just entered Information Retrieval Lab in UMass, having a busy time getting used to the life in USA while starting my career as a research.

While I have considered blogging as a good pastime activity, I decided that I may even need to do blogging for the purpose of my research. It may help me develop immature research ideas, learn how others think differently and see things from a more relaxed perspective—a research should not be considered a work to be fruitful.

As an novice blogger, I started to read what others wrote about research. The article about right measure to use drew my attention today. In most of AI-like problems, having the right measure is critical since you’d be optimizing for the wrong direction otherwise. Conversely, if only you got the right measure, you can improve your result and find how good it worked.

The author(Hal) says that F-measure(weighted harmonic mean of precision and recall) is desirable for classification problems where the problem space can be divided into answer vs. non-answer, in addition to well-known rarity reasons—accuracy is useless when one(usually non-answer) class takes up majority. This assertion seem to be semantically correct in a sense that precision and recall – components of F-measure – are defined assuming problem space division suggested by the author.

But in another posting by Chris Manning (a NLP textbook author!), the usefulness of F-measure is restricted to the cases where there are no partial-match problems—e.g. using F-measure for NER task might be problematic.

Back in IR field, I start to think about the problem of dominant measure of IR—MAP. It assumes binary relevance judgment, which is quite naive given the complex notion of relevance. As the new metrics such as NDCG are starting to be widely adopted by IR community, the limitation of MAP will become less significant.

Tags : IR,Essay Print Comments(1) Trackback
7 Oct

GuestBook

Thank you for visiting LiFiDeA.

Please click ‘comments’ to view guestbook and leave your messages.

Tags : Print Comments(1) Trackback