Identifying primary content from web pages and its application to web search ranking

14 years 7 months ago

Download www.www2011india.com

Web pages are usually highly structured documents. In some documents, content with diﬀerent functionality is laid out in blocks, some merely supporting the main discourse. In other documents, there may be several blocks of unrelated main content. Indexing a web page as if it were a linear document can cause problems because of the diverse nature of its content. If the retrieval function treats all blocks of the web page equally without attention to structure, it may lead to irrelevant query matches. In this paper, we describe how content quality of diﬀerent blocks of a web page can be utilized to improve a retrieval function. Our method is based on segmenting a web page into semantically coherent blocks and learning a predictor of segment content quality. We also describe how to use segment content quality estimates as weights in the BM25F formulation. Experimental results show our method improves relevance of retrieved results by as much as 4.5% compared to BM25F that treats the ...

Srinivas Vadrevu, Emre Velipasaoglu

Real-time Traffic