Automatically generated HTML, as produced by WYSIWYG programs, typically contains much repetitive and unnecessary markup. This paper identifies aspects of such HTML that may be al...
In this paper, we introduce the notion of ranking robustness, which refers to a property of a ranked list of documents that indicates how stable the ranking is in the presence of ...
With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different par...
The technology of opinion extraction allows users to retrieve and analyze people’s opinions scattered over Web documents. We define an opinion unit as a quadruple consisting of...
This paper presents a novel prototype hierarchy based clustering (PHC) framework for the organization of web collections. It solves simultaneously the problem of categorizing web ...