Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

112

Voted

HT
2003
ACM

favoriteEmaildiscussreport

102views Internet Technology» more HT 2003»

Untangling compound documents on the web

15 years 6 months ago

Untangling compound documents on the web

Download mccurley.org

Most text analysis is designed to deal with the concept of a “document”, namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of “document” and “web node” are not synonomous, and that authors often tend to deploy documents as collections of URLs, which we call “compound documents”. In this paper we present new techniques for identifying and working with such compound documents, and the results of some largescale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.

Nadav Eiron, Kevin S. McCurley

Real-time Traffic

Compound Documents | HT 2003 | Most Text Analysis | Text Documents |

claim paper

Related Content

» Untangling the WorldWide Web

» Finding the boundaries of information resources on the web

» Word Segmentation and Recognition for Web Document Framework

» As we may perceive finding the boundaries of compound documents on the web

» Relaxed on the way towards true validation of compound documents

» As we may perceive inferring logical documents from hypertext

» Towards active web clients

» A Unifying Approach for Interface Adaptation

» SuperHapten a comprehensive database for small immunogenic compounds

Post Info
More Details (n/a)

Added	05 Jul 2010
Updated	05 Jul 2010
Type	Conference
Year	2003
Where	HT
Authors	Nadav Eiron, Kevin S. McCurley

Comments (0)