Abstract. Extracting information automatically from texts for database representation requires previously well-grouped phrases so that entities can be separated adequately. This pr...
This paper considers the problem of identifying on the Web compound documents (cDocs) ? groups of web pages that in aggregate constitute semantically coherent information entities...
We present a joint model for Chinese word segmentation and new word detection. We present high dimensional new features, including word-based features and enriched edge (label-tra...
The goal of this work is to study the feasibility of a Heterogeneous Data Classification and Search (HDCS) system and to provide a possible design for its implementing. In order t...
Dorin Carstoiu, Alexandra Cernian, Adriana Olteanu...
Improving data quality is a time-consuming, labor-intensive and often domain specific operation. Existing data repair approaches are either fully automated or not efficient in int...
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Nevi...