Sciweavers

Share
DEXAW
2008
IEEE

Text Extraction from the Web via Text-to-Tag Ratio

11 years 8 months ago
Text Extraction from the Web via Text-to-Tag Ratio
– We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-to-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-to-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Tim Weninger, William H. Hsu
Added 29 May 2010
Updated 29 May 2010
Type Conference
Year 2008
Where DEXAW
Authors Tim Weninger, William H. Hsu
Comments (0)
books