Document Length Normalization by Statistical Regression

13 years 10 months ago

Download www.info.univ-angers.fr

The document-length normalization problem has been widely studied in the ﬁeld of Information Retrieval. The Cosine Normalization [2], the Maximum tf Normalization [1] and the Byte Length Normalization [12] are the most commonly used normalization techniques. In [14], authors studied the retrieval probability of documents w.r.t. their size, using different similarity measures. They have shown that none of existing measures retrieve the documents of different lengths with the same probability. We ﬁrst show here that the document and query sizes are indeed very inﬂuent on the similarity score expectation. Therefore, we propose to realize a statistical regression of the similarity scores distribution w.r.t. document and query sizes in order to normalize them. Experimental results appear to indicate that our approach, as well in the ﬁeld of classical Information Retrieval as when applied to a document clustering process, allows to judge similarities really more fairly.

Sylvain Lamprier, Tassadit Amghar, Bernard Levrat,

Real-time Traffic

Artificial Intelligence | Byte Length Normalization | Document-length Normalization Problem | ICTAI 2007 | Maximum Tf Normalization |

claim paper

» InfoAnalyzer a computeraided tool for building enterprise taxonomies

» Summarizing Text Documents Sentence Selection and Evaluation Metrics

» Predicting and Managing Spoken Disfluencies During HumanComputer Interaction

Post Info
More Details (n/a)

Added	03 Jun 2010
Updated	03 Jun 2010
Type	Conference
Year	2007
Where	ICTAI
Authors	Sylvain Lamprier, Tassadit Amghar, Bernard Levrat, Frédéric Saubion

Comments (0)

Sciweavers

Document Length Normalization by Statistical Regression

Artificial Intelligence | Byte Length Normalization | Document-length Normalization Problem | ICTAI 2007 | Maximum Tf Normalization |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers