Sciweavers

HICSS
1999
IEEE

Collaborative Web Crawling: Information Gathering/Processing over Internet

13 years 8 months ago
Collaborative Web Crawling: Information Gathering/Processing over Internet
The main objective of the IBM Grand Central Station (GCS) is to gather information of virtually any type of formats (text, data, image, graphics, audio, video) from the cyberspace, to process/index/summarize the information, and to push the right information to the right people. Because of the very large scale of the cyberspace, parallel processing in both crawling/gathering and information processing is indispensable. In this paper, we present a scalable method for collaborative web crawling and information processing. The method includes an automatic cyberspace partitioner which is designed to dynamically balance and re-balance the load among processors. It can be can be used when all web crawlers are located on a tightly coupled high-performance system as well as when they are scattered in a distributed environment. We have implemented our algorithms in Java.
Shang-Hua Teng, Qi Lu, Matthias Eichstaedt, Daniel
Added 03 Aug 2010
Updated 03 Aug 2010
Type Conference
Year 1999
Where HICSS
Authors Shang-Hua Teng, Qi Lu, Matthias Eichstaedt, Daniel Alexander Ford, Tobin J. Lehman
Comments (0)