PixED (from Pixel to Electronic Document) is aimed at converting document images into structured electronic documents which can be read by a machine for information retrieval. The...
Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by...
— We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a glo...
We review the literature on automatic document formatting with an emphasis on recent work in the field. One common way to frame document formatting is as a constrained optimizatio...
: Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence,...