Active learning may hold the key for solving the data scarcity problem in supervised learning, i.e., the lack of labeled data. Indeed, labeling data is a costly process, yet an ac...
Text classification has matured as a research discipline over the last decade. Independently, business intelligence over structured databases has long been a source of insights fo...
Traditionally, research in identifying structured entities in documents has proceeded independently of document categorization research. In this paper, we observe that these two t...
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated la...
Victor S. Sheng, Foster J. Provost, Panagiotis G. ...
A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...