The New Zealand Digital Library offers several collections of information over the World Wide Web. Although fulltext indexing is the primary access mechanism, musical collections ...
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manual...
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis ove...
Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Near...
The term "Web 2.0" is used to describe applications that distinguish themselves from previous generations of software by a number of principles. Existing work shows that...