[Paper Note]. A Unified and Discriminative Model for Query Refinement

Paper Notes 1: Jiafeng Guo, Gu Xu, Hang Li, Xueqi Cheng. A Unified and Discriminative Model for Query Refinement. In Proceedings of the 31st Annual International ACM SIGIR Conference (SIGIR’08), 379-386, 2008.
[Pdf]
Categories and Subject Descriptors: [Information Storage and Retrieval]: Information Search and Retrieval — Query formulation(所属研究领域)

Abstract
Query refinement typically includes a number of tasks such as spelling error correction, word splitting, word merging, phrase segmentation, word stemming, and acronym expansion.(查询优化主要包括了这些任务) This paper proposes employing a unified and discriminative model for query refinement. Specifically, it proposes a Conditional Random Field (CRF) model suit-able for the problem, referred to as Conditional Random Field for Query Refinement (CRF-QR)refinement tasks can be performed simultaneously (此方法最大的优点)and thus the accuracy of refinement can be enhanced.

1.Introduction
In fact, many ill-formed queries can be found from the query logs of web search engines. The question then becomes whether we can offer a solution during search which automatically reformulates queries, in order to better represent users’ search needs and help users more easily find the relevant information. (研究此问题的背景)
A straightforward application of existing models, for example conventional CRF, would not work, because the output space would necessarily become extremely large. (传统模型的问题)
In learning, we employ Maximum Likelihood Estimation to estimate the parameters of the model. In prediction, we employ the Viterbi Algorithm to find the best sequence of refined query words. (学习阶段用最大似然估计来估计模型参数, 预测阶段用维特比算法来查找最好的查询优化词序)

2.Related Work
Li et al[2006]. proposed a method for spelling error correction by using a Maximum Entropy (ME) model as well as the Source Channel model. They utilized distributional similarities between the query word and its correction candidate as features in the ME model.
Cucerzan and Brill[2004] addressed a more generalized spelling correction task using a Source Channel model(源信道模型) and query log data. They made use of bigrams as the source model and Weighted Edit Distance as the channel model. However, it is less possible to extend their approach to handle other alteration types, e.g. phrase segmentation.
Peng et al.[2007] proposed a method for conducting stemming on head words of queries. They employed a Statistical Language model in stemming candidate selection.
Risvik et al.[2003] proposed a method for phrase segmentation using the so-called “connexity measures“. A “segmentation score” is computed from connexity values and used as a criterion for segmentation.

3.Query Refinement
Query refinement is a problem as follows. Given an ill-formed query from the user, we refine the query and help the user to better retrieve documents. A number of refinement tasks are involved: spelling error correction, word splitting, word merging, phrase segmentation, word stemming, and acronym expansion.
(多任务整合到一起的必要性). Refinement tasks are often mutually dependent. In the above example, stemming on “learn” needs help from spelling error correction on ”machin”, and vice versa. Furthermore, phrase segmentation on “machine learning” depends on the stemming and the spelling error correction, and vice versa. Therefore, it is better to employ a unified framework with which we perform all the refinement tasks simultaneously. In this way, we are able to significantly boost the accuracy of query refinement.

4.Our Approach
We propose employing a unified and discriminative model in query refinement, specifically the CRF-QR model. There are two types of models: a basic model and an extended model. For ease of explanation, we explain the former model first. In our experiments we used the latter model.
For the task of spelling error correction, for example, we can consider the following refinement operations: deletion, insertion, substitution, and transposition. For word stemming, we can introduce operations such as +s, +ed, and +ing. Similarly, we can define operations for other refinement tasks, as de-scribed in Table.
Table: Query Refinement Tasks and Corresponding Operations

CRF-QR is actually a graphical model, in which vertexes represent refined query words, edges dependencies, and conditional vertexes input query words.
If we assume that only one refinement task can be applied to a query word, then we may directly employ the basic CRF-QR model.
In the extended model, we use multiple sequences of operations as well as their corresponding sequences of intermediate results. In this way, we keep the number of parameters the same as that in the basic CRF-QR model.

5.Experimental results
In our experiments we made use of a real data set consisting of 10,000 queries. The queries were randomly selected from the query log of a commercial web search engine. The average length of queries is 2.8 words.
Among the 10,000 queries, 6,421 queries are refined. Note that the majority of the refinements are phrase segmentation. Furthermore, there are 649 queries with multiple refinements. The labeled data were further divided into a training set containing 7,000 queries and a test set containing 3,000 queries.
Precision (Pre.), recall (Rec.), F1 score(F1), and Accuracy (Acc.)
it could not correctly refine the query “nypark hitel” to “ny park hotel”‘. Specifically, it could not simultaneously change “hitel” to “hotel” and change “nypark” to “ny park”, because one operation needed to leverage the result of the other.

任务: 1. 条件随机场; 2. 维特比算法;
中文的query log. 搜索引擎查询日志库设计为包括约1个月(2008年6月)Sogou搜索引擎部分网页查询需求及用户点击情况的网页查询日志数据集合。为进行中文搜索引擎用户行为分析的研究者提供基准研究语料。

完整版有2G的数据。其中有一小部分样例:
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1
…..
数据格式是:访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>