Measuring ineffectiveness

15 years 10 months ago

Download www.itl.nist.gov

An evaluation methodology that targets ineﬀective topics is needed to support research on obtaining more consistent retrieval across topics. Using average values of traditional evaluation measures is not an appropriate methodology because it emphasizes eﬀective topics: poorly performing topics’ scores are by deﬁnition small, and they are therefore diﬃcult to distinguish from the noise inherent in retrieval evaluation. We examine two new measures that emphasize a system’s worst topics. While these measures focus on diﬀerent aspects of retrieval behavior than traditional measures, the measures are less stable than traditional measures and the margin of error associated with the new measures is large relative to the observed diﬀerences in scores. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software—Performance evaluation General Terms Measurement, Experimentation Keywords evaluation, worst-case behavior

Ellen M. Voorhees

Real-time Traffic