A Probabilistic Retrieval Model for Semistructured Data
My first paper ‘A Probabilistic Retrieval Model for Semistructured Data’ (co-work with Xiaobing Xue and W. Bruce Croft) is to be presented in ECIR’09 at Toulouse, France.
It started as a course project in Advanced Database . As a natural intersection of Database and IR, semi-structured (XML) documeent retrieval problem drew my attention.
A simple literature review revealed that most of past work focused on setting the right granularity (XML element) for the retrieval. Also, most of those work assumed a structured query (XPath) rather than keyword query.
I wanted to see the problem differently. The first obvious thing was that it’s beyond the capability of average users to formulate XPath query — it’s hard even for me!.
And a thought on the typical user’s querying behavior made me realize that we implicitly map each query-term into some aspect of the item we are looking for.
Let’s assume a user trying to find a movie ‘French Kiss’ with partial information about cast (‘meg ryan’) and genre (‘romance’). He or she may type ‘meg ryan romance’ yet it is clear which aspect of data (movie) user meant by each query-word. And we can infer this mapping between query-term and document field by bayesian estimation (more detail on paper).
<?xml version="1.0" encoding="ISO-8859-1"?>
<movie>
<title>French Kiss</title>
<year>1995</year>
<releasedate>USA:5 May 1995</releasedate>
...
<language>English</language>
...
<genre>Comedy</genre>
<genre>Romance</genre>
...
<country>USA</country>
<actors>
...
<actress>Ryan, Meg I</actress>
</actors>
<team>
<director>Kasdan, Lawrence</director>
...
</team>
<plot>
An American woman Kate(Meg Ryan) goes on trip to France
in a desperate effort to find her romance back.
...
</plot>
</movie>
Given this observation and taking into account that each aspect of information is encoded in different XML element, it is natural that raking algorithm for this kind of document can benefit from this mapping between query-word and document. The solution is to put a higher weight for the element which seems to be what user intended. In above example, ‘cast’ element needs to be weighted higher for ‘meg ryan’ and the same can be said about ‘genre’ element and ‘romance’.
This simple idea later turned out to improve retrieval performance significantly. The performance gain was more noticable for collection with clear semantics (e.g. movie descriptions) since it was easier for a system to map each query-word into correct document field.
I’m currently working on applying this retrieval model for the desktop search problem, XML data were replaced with documents with metadata fields.
Tags : Paper Print Comments Trackback