Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and nonengineering. Apart from ad hoc analysis of data and ...
It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the t...
We describe Thresher, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web. Users specify exam...
Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. We stu...
We describe an evaluation of result set filtering techniques for providing ultra-high precision in the task of presenting related news for general web queries. In this task, the n...
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury...