Friday, July 25, 2008

Saving Worlds

"Always back up your data" was the rule of thumb, soon after the personal computer showed up in homes, and people started losing work. A single hard drive might've had all this redundancy, multiple writes of the same stuff, but that wouldn't have been a realistic response to the likelihoods. When a hard drive fails, it's often a mechanical failure, a part wearing out. Multiple writes of the same data blocks doesn't help, as the platens themselves are now inaccessible.

Fast forward to RAID, redundant disk arrays. Now it really does make more sense to mirror data across drives, such that a mechanical failure leads to a hot swap (insert a new drive without powering down) and mathematically low risks. Mirroring across independently breakable parts is engineeringly sound, solves a real problem.

Fast forward again, and now we're looking at huge rack spaces, commodity hardware, with attendant problems of dissipating heat, and interconnecting these clusters to mirror data across data centers. Why? Because you want your web services to be in closer physical proximity to users, for performance reasons, and because you want robust mirroring, as before. Frameworks like Hadoop and GFS solve these kinds of problems.

So what's the analogous trajectory in terms of archival or peripheral media? On a personal computer, once you back off to DVD or flash drive, you have a mounting issue. Not all media remain mounted, much goes off line. Sometimes the archived copies are the only copies i.e. we're not mirroring disk content, but saving a snapshot of what a disk used to contain, contains no longer.

The DVD juke box is another good example, hearkening back to those signature tape drives of the early mainframe business, with personnel taking them from passive storage and mounting them on high speed spoolers (serial access to magnetic tape -- one of the earliest storage media).

What we come back to, using Fuller's namespace some, is a question of lag times. If one assumes accessibility in principle, i.e. the data is saved somewhere, then it's more a question of "when" does one get access. One might imagine a science fiction story in which some time capsule storage gets mounted only once every hundred years. Trying to mirror the whole shebang is way too energy intensive, so the smart searchers just hone in on the specific puzzle pieces they need, like it's some game with glass beads.

Safe to say in many cases: the answers don't arrive in the same order the questions were posed. Put another way, the "jobs queue" and "results stack" are in need of collation, of sorting.

MapReduce might be the appropriate design pattern here, with the initial key a job number, leading to parallel processes doing the information harvesting, with the reduce process remembering the original job number, so the original submitter has at least a fighting chance of knowing when prayers have been answered (ding!... wish 08-7564 is now granted).