Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of c...
We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amo...
In the frame of a Unified Messaging System, a crucial task of the system is to provide the user with key information on every message received, like keywords reflecting the object...
Abstract. In this paper, we present a method for the automatic extraction of numerical fields (zip codes, phone numbers, etc.) from incoming mail documents. The approach is based o...
Extracting titles from a PDFs full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating...