Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

109

Voted

NAACL
2010

favoriteEmaildiscussreport

134views Computational Linguistics» more NAACL 2010»

Language Identification: The Long and the Short of the Matter

14 years 11 months ago

Language Identification: The Long and the Short of the Matter

Download www.aclweb.org

Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection.

Timothy Baldwin, Marco Lui

Real-time Traffic

Computational Linguistics | Documents | Language Identification | NAACL 2010 | Separate Datasets |

claim paper

Related Content

» Language Identification of Short Text Segments with Ngram Models

» Restricted inflectional form generation in management of morphological keyword variation

» Five Reasons to Doubt the Existence of a Geometric Module

Post Info
More Details (n/a)

Added	14 Feb 2011
Updated	14 Feb 2011
Type	Journal
Year	2010
Where	NAACL
Authors	Timothy Baldwin, Marco Lui

Comments (0)