Sciweavers

LREC
2010

Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese

13 years 6 months ago
Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upontwo different, but mutually related, definitions of `word.'Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classifythe text genres with a correct identification rate of 88% as far as the samp...
Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama,
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso, Yasuharu Den
Comments (0)