The plan for splitting up data in Storage@home

  |   5410  |  Post a comment  |  Storage@home
The plan:
1. Introduction
2. The parameters for data division
3. How the data will be divided up
4. Conclusion

1. Introduction

The Storage@home project requires 24/7 data availability and fault tolerant storage. The most obvious answer to this is to have redundancy (multiple copies of the data). According to Adam Beberg's 2007 article, this redundancy was to be set at 4 copies. However, redundancy is not everything for the project...
Beberg cites as an example the Earthquake of December 26th, 2006, which cut off all internet access for 6 Asian countries at once. If by chance the 4 copies of the data were all on machines in affected countries, then all 4 copies would be offline and possibly lost! We can model users of a project like SAH using data provided by the first phase of the project. If many users have the same online/offline habits and are in the same sort of area then the data may be in danger if all copies are on computers owned by those users. The parameters created by this data are being analysed by Beberg, and here are some that he has isolated.

2. The parameters for data division

The most obvious parameter is geographic location, but how do we geotag an IP? It's actually really simple. The Internet Corporation for Assigned Names and Numbers (ICANN), which assigns many IPs on the internet, keeps a contact database of ranges and (where appropriate) individual IPs, and who they were assigned to. This data can be used to create a simple chart of the distance between Internet users. The greater the difference, the smaller the risk of climate/geological/geo-political events affecting both users is. However this logic can be pushed further. Each IP is necessarily a point of internet access, assigned to an ISP. SAH will ensure that multiple copies of one data set do not go to people with the same ISP, in case that ISP starts blocking packets or goes bust or such. (editor's note: what about Virtual ISPs using another ISP's network? a tricky one to answer for Beberg)

Operating systems are another important point. Microsoft Windows in its various guises has over 90% of the global population of PCs, with a further ~7% running Mac OS X and <1% running other variants of Unix, or Linux. The dominance of Microsoft thus represents a potential risk for SAH. Imagine a "Day 0" flaw is discovered in Windows and exploited by a destructive virus, for example memories of the Blaster worm, which was not a "Day 0" flaw. Such a virus could result in a large number of machines going offline or becoming inaccessible in a very short period of time. It is therefore important that all data is available on at least one Mac OS, Unix or Linux machine too, to maximise uptime in such an event. Similarly, the many versions of Windows which are available could be a positive.

A third important point is that not all machines are in the same time zone, due to the global nature of the Internet. Windows Update by default installs updates and reboots if necessary at 3am local time. This downtime may jeopardize the project when a large number of hosts go offline simultaneously, even if only for a short period of time.

There may be further parameters for the division algorithms that have not yet been identified by Stanford. The team will be constantly searching for new parameters to improve the algorithms used for assigning data, to improve reliability.

3. How the data will be divided up

The following stats were used by Stanford for the 2007 article:
  • Upload speed of user: 384Kbps (about 30KB/s)
  • Number of users expected: 100,000
  • Storage space donated per user: 10GB
  • Number of users in a user grid: 10
  • Number of copies of each data block: 4
  • Size of a data block: 100MB
  • Number of data blocks per user: 100
  • Total number of user crashes per day: <1000
  • Average time between two failures: 86 seconds
  • Repair operations per day: 1 million
  • Repair operations per second: 12
  • Network time used per user per day: 10 * 5.5 minutes
  • Upload used for repairs per day per user: 100MB
  • Average number of on-the-fly repair operations: ~4000

Consider the simplified example of 10 users in a grid, each of which can store 2 blocks of data.

The grid is fairly evenly distributed around the world. There is a large concentration of non-Windows OSes. Imagine the Japanese contributor is the unfortunate victim of an earthquake which disables their internet connection.

The servers rapidly realise the disappearance of the Japanese contributor, and add a Brazilian user to the grid. The first block is then duplicated on the orders of the server from the American, Chinese and French contributors who hold copies of the same block.

Once this is complete, the second block is duplicated from the copies held by the Korean, Moroccan and Canadian users.

The Japanese user is no longer a member of the grid, and when he returns following the repair of his internet line he will be assigned to a new block. The Brazilian user will keep the blocks he has just acquired.

4. Conclusion

Despite the obvious and numerous possible pitfalls, Stanford appears to have designed a network that is fairly robust. Splitting data across many ISPs, countries and continents should mean that the project will lose the least possible data. The effects of Murphy's Law on the project remain to be seen, but we see that Stanford takes the difference of "in the wild" machines very seriously, and the fact that they are not all equal!