Building a Web Corpus of Czech

15 years 6 months ago

Download ufal.mff.cuni.cz

Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challeng...

Drahomíra "johanka" Spoustová, Miros

Real-time Traffic

Czech National Corpora | Education | LREC 2010 | Prototype Corpus | Spearman Rank Correlation Coefficient |

claim paper

» Evaluating Utility of Data Sources in a Large Parallel CzechEnglish Corpus CzEng 09

» Ways of Evaluation of the Annotators in Building the Prague CzechEnglish Dependency Treeba...

» Dialogue Speech and Images the Companions Project Data Set

» Building a Bilingual ValLex Using Treebank Token Alignment First Observations

» Building an Italian FrameNet through Semiautomatic Corpus Analysis

» Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification System...

» Rapid bootstrapping of five eastern european languages using the rapid language adaptation...

» A Figure of Merit for the Evaluation of WebCorpus Randomness

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2010
Where	LREC
Authors	Drahomíra "johanka" Spoustová, Miroslav Spousta, Pavel Pecina

Comments (0)

Sciweavers

Building a Web Corpus of Czech

Czech National Corpora | Education | LREC 2010 | Prototype Corpus | Spearman Rank Correlation Coefficient |

Explore & Download

Productivity Tools

Sciweavers