Business Insight from Collection of Unstructured Formatted Documents with IBM Content Harvester

15 years 4 months ago

Download www.cse.iitb.ac.in

In this paper, we report the development and experiments of IBM Content Harvester (CH), a tool to analyze and recover templates and content from word processor created text documents. CH is part of a bigger effort to collect and reuse material generated in business service engagements. Specifically, it works on unstructured formatted documents and works by extracting content, cleansing off sensitive information, tagging it based on user-defined or domain-defined labels, and making it available for publishing in any open format and flexible querying. As a result, one can search for specific information based on tags, aggregate information regardless of document source or formatting peculiarities and publish the content in any format or template. CH has been applied to a broad variety of document collections containing hundreds of documents, including live engagements, to promising effect.

Biplav Srivastava, Yuan-Chi Chang

Real-time Traffic

COMAD 2008 | COMAD 2009 | IBM Content Harvester | Processor Created Text | Unstructured Formatted Documents |

claim paper

Post Info
More Details (n/a)

Added	09 Nov 2010
Updated	09 Nov 2010
Type	Conference
Year	2009
Where	COMAD
Authors	Biplav Srivastava, Yuan-Chi Chang

Comments (0)

Sciweavers

Business Insight from Collection of Unstructured Formatted Documents with IBM Content Harvester

COMAD 2008 | COMAD 2009 | IBM Content Harvester | Processor Created Text | Unstructured Formatted Documents |

Explore & Download

Productivity Tools

Sciweavers