To paraphrase the context of this quote, all of the information created between the beginning of time and 2003 was about 5 EB, give or take (exabyte – 1018 bytes, or 1 Billion GB ). But today, according to him, “we” create 5 EB every 2 days, and it is not clear if this is the majestic plural “we”, as in Google, or “we” as in the entire world. But 5 EB is an awful lot of data (those drunken cat videos on YouTube are really starting to add up…).
Which brings me to my point: the signal-to-noise ratio of that 5 EB is probably low and getting lower every day. And I think we’re starting to get some hard data to prove it. For example, at Chirp, Twitter founder Biz Stone mentioned that Twitter has 105M+ registered users, with 300k new users each day, and 3B+ requests to their API each day. That is a lot of users and a lot of traffic. But how many of those users are active? 20%? 80%? According to the following article, 5% of those Twitter users are responsible for 75% of all traffic, with a rapid drop off in activity beyond 5%:
If the premise of this article is true, then only about 20% of Twitter users are active in a measurable way. And the most active 5% are clogging the pipes with an enormous volume of Tweets — from which we might infer that the signal-to-noise ratio of those Tweets is not very high. Ask yourself this question: what percentage of Tweets that you see are relevant to you in any way? These numbers will only get worse, information-density-wise, as more people sign up for Twitter but otherwise don’t use it in a measurable (or relevant) way.
But back to the original quote. If Google is processing 5 EB of new data every two days, then they definitely need to keep building new data centers in the Columbia River gorge and elsewhere where power is inexpensive. But perhaps what they really need is better filtering technology to drop the 80% of that 5 EB that they don’t really need to keep. Maybe Google needs to keep 5 EB every 2 days to ensure that the 1 EB of good data buried within is not lost, but if they could separate the wheat from the chaff, then they could (1) build fewer data centers, which their investors would really like, and (2) deliver information to us with a much higher signal-to-noise ratio, which we would really, really like.
Google made a name for itself by making search results more relevant than any company that came before them; we can only hope that they (and Twitter, and Facebook, et al.) will continue to pursue information relevance with similar zeal as they process those 5 EB every two days. Otherwise, we should all plan for a whole lot more drunken cat videos.