21 Aug

검색 로그(Query Log)로 할 수 있는 것

웹의 발전은 정보검색(IR)이라는 분야를 도서관학의 일개 전공분야에서 최신 기술과 막대한 인력, 자원이 투입되는 거대 산업으로 바꾸어 놓았습니다. 지금도 대부분의 사람에게 웹 검색은 일상사가 되었지만, 앞으로도 위키/블로그와 같은 데이터 생산 및 공유 기술의 개발, 그리고 그에 따른 데이터의 증가에 따라 웹 검색의 수요는 계속 늘어날 것입니다.

규모도 규모지만, 웹은 IR 연구자들에게 항상 새로운 과제를 제시합니다. 홈페이지 검색, 뉴스 검색, Q&A 검색, 블로그 검색 등은 웹이 없었다면 생기지도 않았을 IR의 세부 연구주제입니다. 새로운 형태의 웹 서비스는 고유의 검색 문제를 안고 태어난다고 해도 과언이 아닙니다. 문제만 준다면 조금 얄미울텐데, 다행히 대부분은 해결의 실마리도 따라옵니다. 웹 문서의 링크 구조를 활용하는 PageRank 알고리즘, 검색에 활용되기 시작한 태그 데이터 등이 그 사례입니다.

하지만 웹의 발전이 IR연구자에게 주는 가장 귀한 선물은 역시 검색 로그(query log)가 아닐까 합니다. 사용자가 서비스에 접속하여 어떤 질의어를 입력하는지, 그리고 어떤 문서를 클릭하는지를 그대로 기록한 검색 로그에는 연구자들이 궁금해하는 검색의 비밀이 고스란히 담겨있기 때문입니다. 더 좋은 것은 이런 데이터를 거의 무제한으로 비용 없이 얻을 수 있다는 점입니다.

문제는 검색 업체가 아닌 이상은 이 데이터에 접근할 수 없다는 점인데, 최근에 MSR에서 검색 로그 데이터를 선별된 연구자에게 제공하는 워크샵을 제안했습니다. MSN의 검색 로그 1500만건이 클릭 데이터와 같이 제공된다고 하니 흔치않은 기회인 것 같습니다.

그렇다면 검색 로그로 무엇을 할 수 있을까요? 가장 단순하게는 질의어 통계를 낼 수 있을 것입니다. 네이버나 다음에서 보여주는 ‘인기 검색어’등이 그 예입니다. 또한, 사용자 활동이 세션(접속->종료)별로 기록된다는 점을 감안하면 사용자가 먼저 입력한 질의어를 가지고 다음 질의어를 예측하는 모델을 만들 수 있을 것입니다. 질의어 맞춤법 교정(query correction)이나 관련 검색어 제안 등이 이러한 사례입니다. 이를 좀더 확장하면 사용자의 검색 실력을 감지하여 그에 따라 적절한 처리를 해줄 수도 있습니다.

올해 SIGIR에서 화제가 되었듯이, 질의어를 분류하고 군집화하여 적절한 처리를 하는 데에도 검색 로그는 필수적입니다. 트렉 등에서 제공하는 수십 수백건의 질의어를 갖고 의미있는 질의어 모델을 만드는 것은 상식적으로 불가능하겠죠. 또한 연구용 컬렉션(문서-질의어 모음)은 실제 검색 서비스에서 나온 데이터가 아니라는 한계도 있습니다.

각 검색에에 대해 클릭된 문서가 질의어에 관련하여 옳은(relevant) 문서라고 간주하면, 대용량 검색 로그를 가지고 검색 모델(retrieval model) 자체를 학습하는 것도 가능합니다. 실제 검색 로그는 검색 서비스를 개발하고 개선하는 데 기반이 되며, 검색 모델 학습(Learning to rank)이 웹 검색의 폭발적 확산과 함께 본격적으로 연구되기 시작한 것도 우연은 아닐 겁니다. 하지만 사용자가 목록에 있는 모든 문서를 검토하고 클릭을 하는 것은 아니며, 원래 검색 의도와 다른 클릭도 있을 수 있기 때문에, 이런 방식으로 검색 로그를 활용하기 위해서는 다양한 노이즈 제거 기법이 적용되어야 합니다.

이러한 검색 로그 분석에는 대용량 데이터가 사용되기 떄문에 MapReduce 등의 대용량 병렬 처리 환경이 필요합니다. 저는 최근에 연구실에 있는 검색 로그를 학교의 Hadoop Cluster에서 분석하고 있는데, Pig라는 환경이 MapReduce를 데이터베이스(RDBMS)처럼 편리하게 사용하도록 도와줍니다. 좀더 익숙해지면 이곳을 통해 소개하도록 하겠습니다.

Tags : IR,검색로그 Print Comments(2) Trackback
15 Jul

형태소분석기

Tags : IR Print Comments Trackback
12 Jul

Discriminative Models for Information Retrieval

PubDate(2004), PubPlace(SIGIR) Author(Nallapati) keyword(Discriminative Model,SVM,Maximum Entropy)

Summary

Applying discriminative model to IR task resulted in improved result for Homepage finding task that typically requires combining many arbitrary features.

Content

Background

  • People most have approached IR using generative classifier
    • LMIR : Assuming that each document forms its own class, it’s similar to classifying given query to best-matching document class using Naive-bayes classifier

Contribution

  • Problem for IR as classification
    • Out-of-vocabulary problem : query-word may not appear in document
      • Overcome by aggregate feature
    • Unbalance of relevance/non-relevance class
      • Overcome by undersampling nr class

Experiment

  • Application of SVM to

Future Work

Comment

Reference

Tags : Paper,IR Print Comments Trackback
10 Jul

On Ranking Techniques for Desktop Search

PubDate(2008), PubPlace(TIS) Author(Cohen,Domshlak,Zwerdling) keyword(Desktop search,Learning to rank,)

Summary

Desktop search experiment with known-item search task.

Content

Background

  • Stuff I’ve seen(SIS) : users sort result by last-update date more frequently than by IR ranking
    • The older the data, the less often it is used
  • Type of information stored in desktop
    • Ephemeral : reminder, short-lived
    • Working : related to ongoing work
    • Archive : long-term resource

Contribution

  • Novel Feature
    • Level : the distance of a file from uppermost directory
    • DirRank : The probability to open a file in specific directory is proportional to the number of files previous opened from this (and its sub) directory. (normalized by files in directory)
  • Selectivity ; combine values of content-similarity feature by the inverse of of the file number with non-zero feature value.
    • e.g. If user query was match with the filename field of 100 files, the score of name field is divided by 100.

Experiment

  • Queries are grouped by the no. of results returned
  • As more results are returned by each query, date-related features became more useful. (selectivity)
  • Combination of result by selectivity was proven to be as effective as learning-based methods

Future Work

Comment

Reference

Tags : Paper,IR Print Comments Trackback
10 Jul

Natural Language Processing and Information Retrieval

PubDate(1999), PubPlace() Author(Voorhees) keyword(NLP,IR)

Summary

NLP doesn’t help ad-hoc retrieval.

Content

Background

Contribution

  • Term-normalization may help
    • Term-mismatch is major performance degrading factor.
    • No evidence that ‘meaning’ representation can boost perf.
  • Queries are troublesome for NLP-IR.
  • With sufficient context, BoW model implicitly disambiguates query-term.
    • e.g. Using document similarity measure to build word sense classifier (Word occurrences in similar documents can be considered to have the same meaning)

Experiment

  • Will ‘concept indexing’ based on WordNet synset improve effectiveness?
    • Using extended VSM : linear interpolation of similarity btw. the query and several vectors of document representation
    • Disambiguation : nouns are disambiguated into a synset id
    • 3 subvectors : 1) words not disambiguated 2) synset ids of disambiguated nouns 3) stems of disambiguated nouns
  • Concept indexing hurt performance in most cases
    • More of term mismatches due to errors in word sense disambiguation
    • Some queries were helped
  • ‘101’ run where non-stemmed verbs and adjectives are not counted as match showed worse perf. than baseline—impact of term mismatch again

Future Work

Comment

Reference

Tags : Paper,IR Print Comments Trackback
12 Jul

Optimizing Search Engines using Clickthrough Data

PubDate(2002), PubPlace(SIGKDD) Author(Joachims) keyword(SVM,Clickthrough,Pairwise preference,Learning to rank,Metasearch)

Content

Background

  • LTR based on expert-judged relevance

Contribution

  • Generation pairwise pref. using clickthrough
  • Training this binary ordering using SVM
    • By minimizing rank correlation to optimal ranking via Kendall’s
  • Meta-search (Strive) engine to compare the learned ranking function with existing methods
    • To perform unbiased comparison of different rankings with clickthrough data, rank combination method that equally presents the links from each system.

Experiment

  • As more data was used for training, the algorithm showed lower rate of error.

Future Work

Comment

The author used the search engine himself to verify his result. I may need to build metasearch engine myself, which can be useful for a variety of tasks.

Reference

Tags : Paper,IR Print Comments Trackback
25 Jan

Significance Test is Significant

One of common questions IR researchers ask is: ‘Is this new retrieval method better than the old one?’ Mostly, we turn to Statistics for the answer, which is the method called ‘significance test’

Given two sets of performance measurements(typically MAP or Precision@K) for both systems, we run significance test and get the probability that the both result set is from the same distribution. If this probability is smaller than some predetermined value (e.g. 0.05), we know that there is no significant difference between the performance of these systems.

In Statistical terms, the probability here is called ‘P-value’ and the assumption that both set is from the same distribution is called ‘Null Hypothesis’, which we may hope to deny. (especially if we devised this new method)

As you may guess, there are many methods for significance test used for IR, differentiated by the assumptions they make—underlying distributions, and so on. According to recent paper in which these methods are compared, it is found that randomization test, bootstrapping test and t-test shows the same result, while Wilcoxon and sign test, simplified forms of randomization test, shows different result from others and therefore discouraged from the use.

Tags : Essay,IR,Statistics Print Comments Trackback
21 Jan

Recommended Reading for IR Research Student

This is the survey article I found while taking IR class last Fall. While this article seemed interesting from its title, I couldn’t get the good grasp of this one as I had little understanding of the IR field in general.

When I read this one again a few days ago, I could finish this one with greater interest. Not only did it provide me with a well-chosen reading list, but also it gave me a clue on IR research trends seen from the perspective of papers published and well-received.

Since my life as a grad student may circle around the papers, it should be worthwhile to summarize lessons I learned.

What makes a ‘classic’ paper?

My first curiosity was why these handful of papers were chosen among thousands of IR papers that came out to the world so far. What made them so special?

Novelty

Any research paper should be ‘new’ in some ways—that’s what makes it a ‘research’. Yet many of these classic papers are greater and more beneficial in their novelty. Some of them started to ask questions people have never even thought of before, some others provided a whole new perspective to an existing problem, still others applied existing technique and theory to a new venue of problems or brought in the knowledge of other field to solve an IR problem.

While selected papers are top-quality in most other criteria, some were selected despite their obvious limitation in methodology or performance, from which we can see the value of bringing in new ideas and approaches.

Result

Since IR is a field rich in performance metric—although few seem to know which one is the best, a work with improved performance is noteworthy. Based on my 5-month long observation in CIIR, there seem to be many cases in which a method with superior result comes out first—by some chance or mistake(!), followed by theoretical justification.

Of course, given that their performance improvement is consistent and significant, most of these ‘result’ papers are proven to be novel later on.

Methodology

If I say that ‘novelty’ papers are excellent in finding a problem worth-solving, some papers draw attention for how they solved the problems. Even without groundbreaking idea or superior result, these papers are read by many people as they teach valuable lessons—mostly in terms of experiment design and interpretation. These ‘methodology’ papers should be especially valuable for students just entering the field.

Survey

As a topic is established as a field of research and the result accumulates for some time, it becomes increasingly for individual researchers to follow-up the result of past research. That’s where the survey papers are needed, in which most of major discoveries are summarized in a single paper.

Which track should one pursue?

Given these conditions for good papers, researchers may ask themselves what their strategy should be here, since most papers seem to have strength in a particular criteria—although there a good number of papers qualified in every perspective.

Here’s my crude suggestion. (Although I know I don’t know well enough to make this kind of remark.)
  • Novelty Papers : If you are confident about your creative potential—you tend to pinpoint things most people may not come up with. To be successful this way, you should read a wide breadth of literature (even in related fields), which may give you a useful combination of ideas no one thought of before.
  • Result Papers : If you’re good at tweaking with a variety of retrieval parameters and settings, getting superior result may be easier for you. All you need to do is find the theory that ‘explains’ your result.
  • Methodology Papers : If you have considerable experience and rigor to investigate given issue better than most people, you may turn most problem you work on into quality research paper—even without good result(!).
  • Survey Papers : This kind of work would probably be left to guru-level research whose research career shows the advancement of the field itself.

Reference

Tags : Essay,IR Print Comments Trackback
29 Nov

Having a Right Measure for IR

This might be my first posting as a IR research. I just entered Information Retrieval Lab in UMass, having a busy time getting used to the life in USA while starting my career as a research.

While I have considered blogging as a good pastime activity, I decided that I may even need to do blogging for the purpose of my research. It may help me develop immature research ideas, learn how others think differently and see things from a more relaxed perspective—a research should not be considered a work to be fruitful.

As an novice blogger, I started to read what others wrote about research. The article about right measure to use drew my attention today. In most of AI-like problems, having the right measure is critical since you’d be optimizing for the wrong direction otherwise. Conversely, if only you got the right measure, you can improve your result and find how good it worked.

The author(Hal) says that F-measure(weighted harmonic mean of precision and recall) is desirable for classification problems where the problem space can be divided into answer vs. non-answer, in addition to well-known rarity reasons—accuracy is useless when one(usually non-answer) class takes up majority. This assertion seem to be semantically correct in a sense that precision and recall – components of F-measure – are defined assuming problem space division suggested by the author.

But in another posting by Chris Manning (a NLP textbook author!), the usefulness of F-measure is restricted to the cases where there are no partial-match problems—e.g. using F-measure for NER task might be problematic.

Back in IR field, I start to think about the problem of dominant measure of IR—MAP. It assumes binary relevance judgment, which is quite naive given the complex notion of relevance. As the new metrics such as NDCG are starting to be widely adopted by IR community, the limitation of MAP will become less significant.

Tags : IR,Essay Print Comments(1) Trackback