Sciweavers

PAKDD
2009
ACM

Scalable Web Mining with Newistic

13 years 10 months ago
Scalable Web Mining with Newistic
Abstract. Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability, which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel. This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.
Ovidiu Dan, Horatiu Mocian
Added 20 May 2010
Updated 20 May 2010
Type Conference
Year 2009
Where PAKDD
Authors Ovidiu Dan, Horatiu Mocian
Comments (0)