We present an adaptive distributed query-sampling framework that is quality-conscious for extracting high-quality text database samples. The framework divides the query-based samp...
Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA c...
For a given set of search engines, a search engine is redundant if its searchable contents can be found from other search engines in this set. In this paper, we propose a method t...
Geoscience studies produce data from various observations, experiments, and simulations at an enormous rate. With proliferation of applications and data formats, the geoscience re...
We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and...