Sciweavers

311 search results - page 10 / 63
» Cleaning Web Pages for Effective Web Content Mining
Sort
View
ACSW
2004
14 years 10 months ago
Discovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and mul...
Jisong Chen, Rowena Chau, Chung-Hsing Yeh
DOCENG
2009
ACM
15 years 4 months ago
Web document text and images extraction using DOM analysis and natural language processing
: © Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing Parag Mulendra Joshi, Sam Liu HP Laboratories HPL-2009-187 Web page text extraction,...
Parag Mulendra Joshi, Sam Liu
COLCOM
2008
IEEE
14 years 11 months ago
Web Canary: A Virtualized Web Browser to Support Large-Scale Silent Collaboration in Detecting Malicious Web Sites
Abstract. Malicious Web content poses a serious threat to the Internet, organizations and users. Current approaches to detecting malicious Web content employ high-powered honey cli...
Jiang Wang, Anup K. Ghosh, Yih Huang
ICDAR
2003
IEEE
15 years 2 months ago
Identifying Story and Preview Images in News Web Pages
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Th...
Jianying Hu, Amit Bagga
KDD
2006
ACM
185views Data Mining» more  KDD 2006»
15 years 9 months ago
Understanding Content Reuse on the Web: Static and Dynamic Analyses
Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar c...
Ricardo A. Baeza-Yates, Álvaro R. Pereira J...