Parallel web pages are important source of training data for statistical machine translation. In this paper, we present a new approach to sentence alignment on parallel web pages....
The standard model of supervised learning assumes that training and test data are drawn from the same underlying distribution. This paper explores an application in which a second...
This paper investigates the new problem of automatic sense induction for instance names using automatically extracted attribute sets. Several clustering strategies and data source...
Ricardo Martin-Brualla, Enrique Alfonseca, Marius ...
Characterizing the relationship that exists between a person's social group and his/her personal behavior has been a long standing goal of social network analysts. In this pa...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...