Why fast WU return >>> volume of WU returns

  |   2312  |  Post a comment  |  Project FAQs
It has been a long-standing debate in the folding community as to which is best: fewer units folded at greater speed or more units folded but taking more time. The argument for the latter case is normally based on PPD as the "best" indicator of progress but 7im on the Folding Forum has developed a couple of analogies to explain why this is not the case, which are copied here verbatim. The analogy is based on several project series totalling 1 million units.

These 1 million work units are not all in one gigantic gumball machine, where the machine spits out a random gumball each time we ask for one. There are about 20-30 gumball machines, of which 2-5 machines may be giving out gumballs to your type of client. And in each of those gumball machines for your client type, we have 1 or many colors of gumballs, which I'll call WU Project numbers. Also, each gumball is numbered, and can only be sent out in a certain sequential order.

So, of the 20 so gumball machines, your SMP client may only be able to get gumballs from a few of that total. And in each of those SMP gumball machines, Pande Group may be sending out a mix of colors (projects) from a mix of those few SMP machines. However, at other times, they might only be sending out 1 color gumball from one specific machine (work server).

The colors (projects) have different priorities, and those priorities change over time. We may get a mix of colors today, but next week we may only get RED. If one project needs more WUs completed, so the data can be compiled and analyzed prior to a grant renewal hearing, we might get GREEN for a whole month, and then we go back to a mix of colors for a while. Then Pande Group might notice a pattern developing in the BLUEs. A pattern, that once completed, might leapfrog the science forward. And in that case, all the remaining GREENs get discarded because they will never fit the new pattern.

Also note the gumball machines are never full to the top with 1 million balls just hanging around. A few thousand at most, because WU #1235 isn't even created until #1234 is returned. And the results from #1234 can affect how #1235 is created. Minor course corrections or changes can be made along the way.

Processing WUs is never a straight line, it's more of a matrix (see also Markov matrix). And because we have to go in a specific order, when you hold BLUE WU # 1234 for an extra 0.8 days, then BLUE WU #1235 is delayed by almost a day, and so on. If you keep repeating the process, each work unit adds an extra day of delay. After 30 WUs, the end has been delayed by as much as a month. This is because you can't guarantee the "2nd" WU that you process in that same 1.8 days is also a BLUE work unit. It is more than likely a GREEN because all the BLUES are tied up an extra day.

For a better understanding of the complexity of the problem for queuing up work units that have to process sequentially, read about Queueing theory or about the bin packing problem. Also read about "opportunity cost" which is a problem of holding a WU for an extra day.

Now some may say the example is over simplified, and I would have to agree. And they might bring up the fact that there are actually thousands of available GREEN or RED WUs at any one time, also true. There is a large amount of parallelism, with the many Runs and Clones in each Generation, as one would expect from a project this size. But we also need to remember how the statistical analysis of the data is done. While the data points are modeled in 3 dimensions, I like to think of it in 2D. Like the scatter pattern in a shotgun shot at a paper target. We can see how the pellets tend to concentrate in one area of the target, or another. Concentrated data points indicate the correctly folded protein configuration, with regard to temps, solution, etc. All variables in the Runs and Clones.

Now some will claim that more WUs are better, even if turned in slower. Like a wide river moving slowly towards the ocean. Kind of like that shotgun blast, only in slow motion. All the data arrives at one time, at a much later time, but the answer still gets there. We see the concentration of pellets, or data points.

However, if we fold fewer work units much faster, we get data points (pellets) hitting the target much sooner. And it is often possible to determine where the data points are concentrating before all of the work units arrive (before they all hit the paper target). We can see results sooner. And we can adjust the configuration of future work units to more accurately zero in on that concentration. Or we add certain Runs and Clones mid project to better define the edges of the concentration. Or we can eliminate GREENs that seem to be missing the target completely, end that project # earlier, and send out helpful BLUEs instead. We didn't need to wait for the whole slow fleet of boats to arrive. We scuttled the less helpful boat halfway to the ocean. ;)

We don't have to fold the total sum of the WUs to get results. But to map the results, we need many multiple generations of work units building on each other, i.e. many pellets hitting the target. And we can get answers without waiting for 1 million gumballs to be dispensed.