Learning Deep Web Crawling with Diverse Features

13 years 11 months ago

Download 117.36.50.52

—The key to Deep Web crawling is to submit promising keywords to query form and retrieve Deep Web content efficiently. To select keywords, existing methods make a decision based on keywords’ statistic information deriving from TF and DF in local acquired records, thus work well only in textual databases providing full text search interfaces, whereas not well in structured databases of multi-attribute or field-restricted search interfaces. This paper proposes a novel Deep Web crawling method. Keywords are encoded as a tuple by its linguistic, statistic and HTML features so that a harvest rate evaluation model can be learned from the issued keywords for the un-issued in future. The method breaks through the assumption of plain-text search made by existing methods. Experimental results show that the method outperforms the state of the art methods. Keywords-Hidden Web; Deep Web surfacing; machine learning

Lu Jiang, Zhaohui Wu, Qinghua Zheng, Jun Liu

Real-time Traffic