Latent Semantic Indexing (LSI) has been validated to be effective on many small scale text collections. However, little evidence has shown its effectiveness on unsampled large sca...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Program analysis and automated test generation have primarily been used to find correctness bugs. We present complexity testing, a novel automated test generation technique to ...
One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in webscal...
We propose a novel technique for semi-supervised image annotation which introduces a harmonic regularizer based on the graph Laplacian of the data into the probabilistic semantic ...
Yuanlong Shao, Yuan Zhou, Xiaofei He, Deng Cai, Hu...