In this paper, we study the problem of effective keyword search over XML documents. We begin by introducing the notion of Valuable Lowest Common Ancestor (VLCA) to accurately and ...
Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the s...
Abstract. The structural heterogeneity and complexity of XML repositories makes query formulation challenging for users who have little knowledge of XML. To assist its users, an XM...
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information...
This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algo...