Data & AI

How to transfer 120 terabytes of data?

by Jonathan Bruce Posted on March 20, 2007

Wired Magazine continues a fascinating thread which I first picked up on Jonathan Schwartz's blog on what may be quickly becoming the next frontier for enterprise computing. The problem can be framed quite succinctly - how can you transfer vast quantities of data from one place to another? Schwartz posses the problem:- if you have a Petabyte of data (that's a million gigabytes), what would be the most efficient way of transferring it from say San Francisco to Hong Kong ? He goes on to paint a rather bleak picture: "So if you had a half megabit per second internet connection, which is relatively high in the US (relatively low compared to residential bandwidth available in, say, Korea), it'd take you 16 billion seconds, or 266 million minutes, or 507 years to transmit the data." In fact, by his calculation you could record this amount of information on the a set of hard disks with equivalent storage capacity and leisurely sail across the Pacific ocean and still deliver this information faster. Ridiculous as it sounds, this is a reasonable solution until you get to the problem facing the Hubble telescope. Google's Chris DiBona is reported to have met with NASA to determine an effective way of solving this problem, amd I wonder with the knowledge that Schwartz's solution is not a practical approach. The solution: FedExNet. It works something like this:- Google packages dedicated machines which are then shipped to teams of scientists across the globe. Each team then transfers their portion of the Hubble telescope data and return it to Google. There the consolidation process takes place and the archive grows. Should team want the data back, the process can simply reversed. You can read more on what Google intends to do with all this data in the Wired article, but it's interesting that resorting to physical media remains the optimal solution. Given infrastructure projections that I've read, it seems this approach will remain the same for sometime, but data consolidations of such massive scales will likely lend itself to a new set of data access patterns. Likely to follow include seismic shifts in query strategies, techniques and technologies to ensure applications can extract discrete, but sufficiently useful amounts of information from these mega-databases. Perhaps a community driven query engine will emerge that leverages Web 2.0 tagging to splice together queries to build more efficient queries ? Given the degree of data, that may not turn out to be that crazy of an idea...

Jonathan Bruce

View all posts from Jonathan Bruce on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Related Tags

open source

How to transfer 120 terabytes of data?

Jonathan Bruce

Related Tags

Latest Stories in Your Inbox