This paper presents a comparison among several well-known machine learning techniques when they are used to carry out a one-session ahead prediction of page categories. We use reco...
We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amo...
This paper presents Carnegie Mellon University’s experiments on the mixed named-page and homepage finding task of the TREC 12 Web Track. Our results were strong; we achieved the...
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrele...
Abstract. Malicious Web content poses a serious threat to the Internet, organizations and users. Current approaches to detecting malicious Web content employ high-powered honey cli...