This paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) t...
In this paper we propose a domainindependent text segmentation method, which consists of three components. Latent Dirichlet allocation (LDA) is employed to compute words semantic ...
In this paper, the authors address the significance and complexityof tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoin...
This paper proposes a new approach for the automatic extraction of bilingual terms from a domain-specific bilingual parallel corpus. We combine existing monolingual term extractor...
With the rapidly growing use of the audio and multimedia information over the Internet, the technology for retrieving speech information using voice queries is becoming more and mo...