Sciweavers

WWW
2011
ACM

Web scale NLP: a case study on url word breaking

12 years 11 months ago
Web scale NLP: a case study on url word breaking
This paper uses the URL word breaking task as an example to elaborate what we identify as crucialin designingstatistical natural language processing (NLP) algorithmsfor Web scale applications: (1) rudimentary multilingual capabilities to cope with the global nature of the Web, (2) multi-style modeling to handle diverse language styles seen in the Web contents, (3) fast adaptation to keep pace with the dynamic changes of the Web, (4) minimal heuristic assumptions for generalizability and robustness, and (5) possibilities of efficient implementations and minimal manual efforts for processing massive amount of data at a reasonable cost. We first show that the state-of-the-art word breaking techniquescan be unified and generalized under the Bayesian minimum risk (BMR) framework that, using a Web scale N-gram, can meet the first three requirements. We discuss how the existing techniques can be viewed asintroducing additional assumptions to the basic BMR framework, and describe a generic ye...
Kuansan Wang, Christopher Thrasher, Bo-June Paul H
Added 15 May 2011
Updated 15 May 2011
Type Journal
Year 2011
Where WWW
Authors Kuansan Wang, Christopher Thrasher, Bo-June Paul Hsu
Comments (0)