Sciweavers

DEXAW
2008
IEEE

Text Extraction from the Web via Text-to-Tag Ratio

13 years 11 months ago
Text Extraction from the Web via Text-to-Tag Ratio
– We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-to-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-to-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Tim Weninger, William H. Hsu
Added 29 May 2010
Updated 29 May 2010
Type Conference
Year 2008
Where DEXAW
Authors Tim Weninger, William H. Hsu
Comments (0)