Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs...
Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodon...
Interoperability is one of the main issues in creating a networked system of repositories The approaches range from simply forcing one metadata standard on all participating repos...
Marek Hatala, Griff Richards, Timmy Eap, Jordan Wi...
The massive distribution of the crawling task can lead to inefficient exploration of the same portion of the Web. We propose a technique to guide crawlers exploration based on the...
We initiate a novel study of clustering problems. Rather than specifying an explicit objective function to optimize, our framework allows the user of clustering algorithm to speci...
List question answering (QA) offers a unique challenge in effectively and efficiently locating a complete set of distinct answers from huge corpora or the Web. In TREC-12, the med...