« Persistent chat is here to stay | Main | At least 6 megapixels »

The power of aggregation

Asking a representative sample of some 5,000 people about their voting intentions will produce a good estimate of the outcome of an election in the United States, a country of 300m people. What would it tell you if you had the power to aggregate the behaviour of 2m people?
Blood and numbers? - DSCN2906    Fred Wilson reports at A VC that ComScore is using the power of the internet to compile estimates of e-commerce activity in the US, and that the numbers are very close to the official government numbers. They do so by monitoring a sample of 2m people. (But who says the government's figures are the most accurate? Maybe ComScore's are.)
   Those with the power to aggregate and estimate have made fortunes in their day, examples are Gallup and Nielsen Media Research. With the internet, new models are becoming possible, an opportunity explored by ComScore and YouGov and others.
   Not just numbers can be aggregated but also knowledge as represented by the awesome Wikipedia project started by Jimmy Wales. Google is aggregating link  and search behaviour to come up with relevant search results and advertising for queries. And then of course there is my own crazy idea to piece together a detailed map of the planet by geotagging amateur aerial photographs, under the creative commons license of course.
   The truth is out there; you just have to figure out the best mechanism to aggregate the data. The internet is part of that mechanism.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83420448a53ef00d834ace72f69e2

Listed below are links to weblogs that reference The power of aggregation:

Comments

Allan

Having been buried in data analysis for the last couple of weeks, I can't help but to respond to your comment:

"Asking a representative sample of some 5,000 people about their voting intentions will produce a good estimate of the outcome of an election in the United States, a country of 300m people."

Actually, I think the idea is to ask a NON-representative sample. You want to find the swing voters and not survey the people who are not going to change their minds.

The difference in votes between the main candidates at the last election was about 3 million.

(Source: http://en.wikipedia.org/wiki/U.S._presidential_election,_2004 )

That's the number you are really trying to predict. (Alternatively: the number of swing voters is probably of the same order of magnitude, say 6M or 5% of the voting population.)

As long as you can identify this group (and that's where the pollsters sometimes go spectacularly wrong) then 5,000 is a very reasonable sample size. (sqrt(6M)~2,500)

You only need to frequently sample the people who are changing.

That's kind of why Technorati is a better guide to new ideas than the default Google search.

Focus on the change, the boundary, the interface. For sure you need to understand the whole data set in order to predict where that interface is, but once you know, don't sweat it. Focus your energy on sampling the interface.

And then just be careful that somebody else doesn't spot and exploit a change in the boundaries before you.

Colin Donald

Lars,

You may be interested in comparing the different aggregation models of edgeio, iNods and Reevoo. Links to all via this post on my blog:

http://if.futurescape.co.uk/2006/02/aggregating_the.html

The comments to this entry are closed.