Text-Based Document Similarity Matching Using Sdtext

3 years 21 days ago
Text-Based Document Similarity Matching Using Sdtext
ACT Forensics examiners frequently try to identify duplicate files during an investigation. They might do so to identify known files of interest, or to allow more rapid review of documents that appear to be similar. Current forensic tools for detecting duplicate files operate over the low-level bits of the file, typically using hashing. While this can be a fast and effective method in many cases, it can fail due to differences in file format. We introduce sdtext, a tool developed to identify similar files based on their textual contents, which is robust to changes in format. We show that sdtext is far more accurate than existing tools in matching files that contain the same text in different formats.
Clay Shields
Added 03 Apr 2016
Updated 03 Apr 2016
Type Journal
Year 2016
Authors Clay Shields
Comments (0)