- 一般社団法人 人工知能学会
- 人工知能学会論文誌 (ISSN:13460714)
- vol.30, no.1, pp.161-171, 2015-01-06 (Released:2015-01-06)
Query classification is an important technique for web search engines, allowing them to improve users' search experience. Specifically, query classification methods classify queries according to topical categories, such as celebrities and sports. Such category information is effective in improving web search results, online advertisements, and so on. Unlike previous studies, our research focuses on trend queries that have suddenly become popular and are extensively searched. Our aim is to classify such trend queries in a timely manner, i.e., classify the queries on the same day when they become popular, in order to provide a better search experience. To reduce the expensive manual annotation costs to train supervised learning methods, we focus on a label propagation method that belongs to the semi-supervised learning family. Specifically, the proposed method is based on our previous method that constructs a graph using a corpus, and propagates a small number of ground-truth categories of labeled queries in order to estimate the categories of unlabeled queries. We extend this method to cut ineffective edges to improve both classification accuracy and computational efficiency. Furthermore, we investigate in detail the effects of different corpora, i.e., web/blog/news search results, Tweets, and news pages, on the trend query classification task. Our experiments replicate the situation of an emerging trend query; the results show that web search results are the most effective for trend query classification, achieving a 50.1% F-score, which significantly outperforms the state-of-the-art method by 7.2 points. These results provide useful insights into selecting an appropriate dataset for query classification from the various types of data available.