Sciweavers

WWW
2005
ACM

Extracting semantic structure of web documents using content and visual information

15 years 20 days ago
Extracting semantic structure of web documents using content and visual information
This work aims to provide a page segmentation algorithm which uses both visual and content information to extract the semantic structure of a web page. The visual information is utilized using the VIPS algorithm and the content information using a pre-trained Naive Bayes classifier. The output of the algorithm is a semantic structure tree whose leaves represent segments having unique topic. However contents of the leaf segments may possibly be physically distributed in the web page. This structure can be useful in many web applications like information retrieval, information extraction and automatic web page adaptation. This algorithm is expected to outperform other existing page segmentation algorithms since it utilizes both content and visual information. Categories and Subject Descriptors: H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia General Terms: Algorithms, Design.
Rupesh R. Mehta, Pabitra Mitra, Harish Karnick
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2005
Where WWW
Authors Rupesh R. Mehta, Pabitra Mitra, Harish Karnick
Comments (0)