Clustering Template Based Web Documents

8 years 4 months ago
Clustering Template Based Web Documents
More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result. As more and more documents on the World Wide Web are generated automatically by Content Management Systems (CMS), more and more of them are based on templates. Templates can be seen as framework documents which are filled with different contents to compile the final documents. They are a standard (if not even essential) CMS technology. Templates provide the managed web sites wit...
Thomas Gottron
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where ECIR
Authors Thomas Gottron
Comments (0)