Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

141

ISMIS
2005
Springer

166views Artificial Intelligence» more ISMIS 2005»

Identifying Content Blocks from Web Documents

15 years 10 months ago

Identifying Content Blocks from Web Documents

Download clgiles.ist.psu.edu

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 diﬀerent websites, our algorithms signiﬁcantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.

Sandip Debnath, Prasenjit Mitra, C. Lee Giles

Real-time Traffic

Engines Index Web-pages | ISMIS 2005 | Non-informative Blocks | Primary Content Blocks |

claim paper

Related Content

» Identifying primary content from web pages and its application to web search ranking

» Discovering informative content blocks from Web documents

» Expected Utility of Content Blocks in Web Content Extraction

» As we may perceive inferring logical documents from hypertext

» Revealing Hidden Community Structures and Identifying Bridges in Complex Networks An Appli...

» Automatic Extraction of Data Points and Text Blocks from 2Dimensional Plots in Digital Doc...

» Identifying Story and Preview Images in News Web Pages

» Cleaning Web Pages for Effective Web Content Mining

» Postal Address Detection from Web Documents

Post Info
More Details (n/a)

Added	27 Jun 2010
Updated	27 Jun 2010
Type	Conference
Year	2005
Where	ISMIS
Authors	Sandip Debnath, Prasenjit Mitra, C. Lee Giles

Comments (0)