Next steps in near-duplicate detection for eRulemaking

15 years 2 months ago

Download www.cs.cmu.edu

Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. The identification of exact- and near-duplicate texts, and recognition of unique text within nearduplicate documents, is an important component of data cleaning and integration processes for eRulemaking. This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. Experimental results demonstrate that DURIAN is about as effective as human assessors. The paper c...

Hui Yang, Jamie Callan, Stuart W. Shulman

Real-time Traffic

DGO 2006 | DGO 2007 | Form Letters | Near-duplicate Detection | Public Comments |

claim paper

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2006
Where	DGO
Authors	Hui Yang, Jamie Callan, Stuart W. Shulman

Comments (0)

Sciweavers

Next steps in near-duplicate detection for eRulemaking

DGO 2006 | DGO 2007 | Form Letters | Near-duplicate Detection | Public Comments |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers