Text segmentation and Chinese site search

4 years 10 months ago
Text segmentation and Chinese site search
Automatic segmentation and overlapping bigrams are the most common methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the two approaches using a 465,000 page crawl and test queries applicable to the university context. 503 pairs of result sets were judged by 56 Chinese students. Although there are differences on certain queries, we find no overall advantage to either method. To understand the merits of each approach, we analyze cases where they performed differently. Our analysis enumerates situations which favour segmentation, and those which favour bigrams. We observe that further improvements in segmentation accuracy will not improve retrieval effectiveness. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—search process Keywords Chinese IR, ...
Liyuan Zhou, David Hawking, Paul Thomas
Added 13 Apr 2016
Updated 13 Apr 2016
Type Journal
Year 2015
Where ADCS
Authors Liyuan Zhou, David Hawking, Paul Thomas
Comments (0)