A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...
This paper describes a hidden Markov model (HMM) based approach to perform search interface segmentation. Automatic processing of an interface is a must to access the invisible co...
Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming what is referred to as the deep web. It is desirable to have systems that can pr...
We consider a network of autonomous peers forming a logically global but physically distributed search engine, where every peer has its own local collection generated by independe...
Josiane Xavier Parreira, Sebastian Michel, Gerhard...
Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over multip...