Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

13 years 9 months ago

Download www.grf.bg.ac.rs

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

Milos Kovacevic, Michelangelo Diligenti, Marco Gor

Real-time Traffic

Data Mining | Heuristics Defined Objects | ICDM 2002 | Naive Bayes Classifier | Page |

claim paper

» Searching the Web with Mobile Images for Location Recognition

» A comparison of implicit and explicit links for web page classification

» Knowing a web page by the company it keeps

» Extracting semantic structure of web documents using content and visual information

» Can chinese web pages be classified with english data source

» The volume and evolution of web page templates

» Document Visualization on Small Displays

» Exploring Scalable Vector Graphics for Visually Impaired Users

Post Info
More Details (n/a)

Added	14 Jul 2010
Updated	14 Jul 2010
Type	Conference
Year	2002
Where	ICDM
Authors	Milos Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko M. Milutinovic

Comments (0)

Sciweavers

Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

Data Mining | Heuristics Defined Objects | ICDM 2002 | Naive Bayes Classifier | Page |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers