Sciweavers

WWW
2009
ACM

Incorporating site-level knowledge to extract structured data from web forums

14 years 5 months ago
Incorporating site-level knowledge to extract structured data from web forums
Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. T...
Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei
Added 21 Nov 2009
Updated 21 Nov 2009
Type Conference
Year 2009
Where WWW
Authors Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, Wei-Ying Ma
Comments (0)