Abstract. Searching specialized collections, such as biomedical literature, typically requires intimate knowledge of a specialized terminology. Hence, it can be a disappointing exp...
In a corpus of jokes, a human might judge two documents to be the "same joke" even if characters, locations, and other details are varied. A given joke could be retold w...
This paper describes how use the HTMLEditorKit to perform web data mining on EDGAR (Electronic Data-Gathering, Analysis, and Retrieval system). EDGAR is the SEC's (U.S. Secur...
In a data warehousing process, the data preparation phase is crucial. Mastering this phase allows substantial gains in terms of time and performance when performing a multidimensio...
Traditional information retrieval techniques based on keyword search help to identify a ranked set of relevant documents, which often contains many documents in the top ranks that...