Do TREC web collections look like the web?

10 years 3 months ago
Do TREC web collections look like the web?
We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, and the .GOV test collection used in the TREC 2002 Web and Interactive Tracks, with common measures used in the web topology community, in order to see if these collections "look like" the web. This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed by blindly following links. The .GOV collection is a fairly straightforward 18GB crawl of sites in the .gov domain. In contrast, WT10g was carved out from a much larger crawl specifically to be a web search test collection within the reach of university researchers. Do such collections retain the properties of the larger web? In the case of WT10g and .GOV, yes.
Ian Soboroff
Added 23 Dec 2010
Updated 23 Dec 2010
Type Journal
Year 2002
Authors Ian Soboroff
Comments (0)