Finding text reuse on the web

13 years 11 months ago

Download www.wsdm2009.org

With the overwhelming number of reports on similar events originating from diﬀerent sources on the web, it is often hard, using existing web search paradigms, to ﬁnd the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between diﬀerent sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of ﬁnding original sources on the web, such as deﬁning a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are signiﬁcantly more accurate than standard web search ...

Michael Bendersky, W. Bruce Croft

Real-time Traffic