Unknown Word Extraction for Chinese Documents

13 years 4 months ago

Download acl.ldc.upenn.edu

There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Conventionally unknown words were extracted by statistical methods because statistical methods are simple and efficient. However the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall, since character strings with statistical significance might be phrases or partial phrases instead of words and low frequency new words are hardly identifiable by statistical methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context and content information of unknown words in the steps of detection process, extraction process, and verification process. A practical unknown word extraction system was implemented which online identi...

Keh-Jiann Chen, Wei-Yun Ma

Real-time Traffic