We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Recent advances in storage technology make it possible to store a series of large Web archives. It is now an exciting challenge for us to observe evolution of the Web. In this pap...
Social media websites promote diverse user interaction on media objects as well as user actions with respect to other users. The goal of this work is to discover community structu...
Yu-Ru Lin, Jimeng Sun, Paul Castro, Ravi B. Konuru...
In microblogging services such as Twitter, the users may become overwhelmed by the raw data. One solution to this problem is the classification of short text messages. As short te...
Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Fer...
Access to on-line information via the Web is exploding. Index and retrieval engines already start to integrate a huge variety of heterogeneous repositories. However, the heterogen...
Boris Chidlovskii, Uwe M. Borghoff, Pierre-Yves Ch...