Sciweavers

31 search results - page 1 / 7
» BlogBuster: A Tool for Extracting Corpora from the Blogosphe...
Sort
View
LREC
2010
216views Education» more  LREC 2010»
13 years 6 months ago
BlogBuster: A Tool for Extracting Corpora from the Blogosphere
This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, ...
Georgios Petasis, Dimitrios Petasis
ACL
2012
11 years 7 months ago
ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora
The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible...
Marcis Pinnis, Radu Ion, Dan Stefanescu, Fangzhong...
CICLING
2009
Springer
14 years 5 months ago
Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation
We use existing tools to automatically build two parallel treebanks from existing parallel corpora. We then show that combining the data extracted from both the treebanks and the ...
John Tinsley, Mary Hearne, Andy Way
WWW
2003
ACM
14 years 5 months ago
Annotating Web pages for the needs of Web Information Extraction Applications
This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the a...
Georgios Sigletos, Dimitra Farmakiotou, Konstantin...
COLING
2010
12 years 12 months ago
Mining Large-scale Comparable Corpora from Chinese-English News Collections
In this paper, we explore a CLIR-based approach to construct large-scale Chinese-English comparable corpora, which is valuable for translation knowledge mining. The initial source...
Degen Huang, Lian Zhao, Lishuang Li, Haitao Yu