Mining Bilingual Data from the Web with Adaptively Learnt Patterns

13 years 2 months ago

Download www.aclweb.org

Mining bilingual data (including bilingual sentences and terms1 ) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify potential translation pairs (seeds) using a word based alignment model which takes both translation and transliteration into consideration; 3) pattern learning: learn generalized patterns with the identified seeds; 4) pattern based mining: extract all bilingual data in the page using the learned patterns. Our experiments on Chinese web pages produced more than 7.5 million pairs of bilingual sentences and more than 5 mill...

Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu,

Real-time Traffic

ACL 2009 | Bilingual | Bilingual Data | Computational Linguistics | Web Pages |

claim paper

» An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL P...

» Mining Translations of Web Queries from Web Clickthrough Data

» The Reconstruction of User Sessions from a Server Log Using Improved TimeOriented Heuristi...

» Data Mining for Web Personalization

» Business Intelligence from Web Usage Mining

» Dynamic and Scalable Evolutionary Data Mining An Approach Based on a SelfAdaptive Multiple...

» Warehousing and Mining Web Logs

» An adaptive website system to improve efficiency with web mining techniques

Post Info
More Details (n/a)

Added	16 Feb 2011
Updated	16 Feb 2011
Type	Journal
Year	2009
Where	ACL
Authors	Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, Qingsheng Zhu

Comments (0)

Sciweavers

Mining Bilingual Data from the Web with Adaptively Learnt Patterns

ACL 2009 | Bilingual | Bilingual Data | Computational Linguistics | Web Pages |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers