Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

122

Voted

EACL
2006
ACL Anthology

156views Natural Language Processing» more EACL 2006»

Large Linguistically-Processed Web Corpora for Multiple Languages

15 years 4 months ago

Large Linguistically-Processed Web Corpora for Multiple Languages

Download acl.ldc.upenn.edu

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian, with corpus sizes of over 1 billion words in each case. We provide Web access to the corpora in our query tool, the Sketch Engine.

Marco Baroni, Adam Kilgarriff

Real-time Traffic

Corpus Query Tool | EACL 2006 | Linguistic Data | Natural Language Processing | Sophisticated Linguistic Queries |

claim paper

Related Content

» Building a Web Corpus of Czech

» A Corpus Factory for Many Languages

» WebKhoj Indian language IR from multiple character encodings

» Mining the Web to Create Minority Language Corpora

» Automatic Acquisition of ChineseEnglish Parallel Corpus from the Web

» Improving the estimation of relevance models using large external corpora

» A Rich Feature Vector for ProteinProtein Interaction Extraction from Multiple Corpora

» Translating unknown queries with web corpora for crosslanguage information retrieval

» Effective query formulation with multiple information sources

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2006
Where	EACL
Authors	Marco Baroni, Adam Kilgarriff

Comments (0)