In this paper, based on the study of the specificity of historical printed books, we first explain the main error sources in classical methods used for page layout analysis. We sho...
Jean-Yves Ramel, S. Leriche, M. L. Demonet, S. Bus...
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-re...
Chunyan Liang, Li Guo, Zhaojie Xia, Feng-Guang Nie...
This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items...
Wrapping is the process of navigating a data source, semiautomatically extracting data and transforming it into a form suitable for data processing applications. There are current...
Abstract. Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effecti...