We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and...
Linked Data (LD) provides principles for publishing data that underpin the development of an emerging web of data. LD follows the web in providing low barriers to entry: publisher...
Norman W. Paton, Klitos Christodoulou, Alvaro A. A...
Ontology construction usually requires a domain-specific corpus for building corresponding concept hierarchy. The domain corpus must have a good coverage of domain knowledge. Wiki...
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with ...
Clustering hypertext document collection is an important task in Information Retrieval. Most clustering methods are based on document content and do not take into account the hype...
Konstantin Avrachenkov, Vladimir Dobrynin, Danil N...