Towards a Balanced Named Entity Corpus for Dutch

15 years 6 months ago

Download www.lrec-conf.org

This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the ...

Bart Desmet, Véronique Hoste

Real-time Traffic

Education | Entity Annotation Guidelines | LREC 2010 | Named Entity Recognition | Substantial Annotated Corpus |

claim paper

» A Named Entity Recognition System for Dutch

» Towards the Annotation of Named Entities in the National Corpus of Polish

» Interacting Semantic Layers of Annotation in SoNaR a Reference Corpus of Contemporary Writ...

» Entity Mention Detection using a Combination of RedundancyDriven Classifiers

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2010
Where	LREC
Authors	Bart Desmet, Véronique Hoste

Comments (0)

Sciweavers

Towards a Balanced Named Entity Corpus for Dutch

Education | Entity Annotation Guidelines | LREC 2010 | Named Entity Recognition | Substantial Annotated Corpus |

Explore & Download

Productivity Tools

Sciweavers