The power of aggregation
Asking a representative sample of some 5,000 people about their voting intentions will produce a good estimate of the outcome of an election in the United States, a country of 300m people. What would it tell you if you had the power to aggregate the behaviour of 2m people?
Fred Wilson reports at A VC that ComScore is using the power of the internet to compile estimates of e-commerce activity in the US, and that the numbers are very close to the official government numbers. They do so by monitoring a sample of 2m people. (But who says the government's figures are the most accurate? Maybe ComScore's are.)
Those with the power to aggregate and estimate have made fortunes in their day, examples are Gallup and Nielsen Media Research. With the internet, new models are becoming possible, an opportunity explored by ComScore and YouGov and others.
Not just numbers can be aggregated but also knowledge as represented by the awesome Wikipedia project started by Jimmy Wales. Google is aggregating link and search behaviour to come up with relevant search results and advertising for queries. And then of course there is my own crazy idea to piece together a detailed map of the planet by geotagging amateur aerial photographs, under the creative commons license of course.
The truth is out there; you just have to figure out the best mechanism to aggregate the data. The internet is part of that mechanism.
Having been buried in data analysis for the last couple of weeks, I can't help but to respond to your comment:
"Asking a representative sample of some 5,000 people about their voting intentions will produce a good estimate of the outcome of an election in the United States, a country of 300m people."
Actually, I think the idea is to ask a NON-representative sample. You want to find the swing voters and not survey the people who are not going to change their minds.
The difference in votes between the main candidates at the last election was about 3 million.
(Source: http://en.wikipedia.org/wiki/U.S._presidential_election,_2004 )
That's the number you are really trying to predict. (Alternatively: the number of swing voters is probably of the same order of magnitude, say 6M or 5% of the voting population.)
As long as you can identify this group (and that's where the pollsters sometimes go spectacularly wrong) then 5,000 is a very reasonable sample size. (sqrt(6M)~2,500)
You only need to frequently sample the people who are changing.
That's kind of why Technorati is a better guide to new ideas than the default Google search.
Focus on the change, the boundary, the interface. For sure you need to understand the whole data set in order to predict where that interface is, but once you know, don't sweat it. Focus your energy on sampling the interface.
And then just be careful that somebody else doesn't spot and exploit a change in the boundaries before you.
Posted by: Allan | 27 February 2006 at 21:57
Lars,
You may be interested in comparing the different aggregation models of edgeio, iNods and Reevoo. Links to all via this post on my blog:
http://if.futurescape.co.uk/2006/02/aggregating_the.html
Posted by: Colin Donald | 28 February 2006 at 10:11