Optimization for a2 core v2.10

  |   4239  |  Comments (3)  |  Optimizations

Introduction



The performance problems of the a2 core on dual cores are in the past! "tear", a user of the official folding forums, has discovered a variable linked to the MPI layer which seems to solve the performance problems of the a2 core on native and virtual dual-core machines.

Technically, this variable changes the behaviour of the core with respect to memory management for MPI processes. By default, the Folding@home core uses shared memory to exchange information between processes.This memory is a memory area not reserved for a given process, so everyone can read and write.

After activation of the variable, MPI uses a more conventional memory setup: shared memory is no longer used, with each process allocated a reserved memory space, and the exchanges are performed over TCP/IP (via the localhost in this scenario). MPI is used here in the way it was originally intended: the exchange of data between processes distributed over a network (i.e. a cluster).

Application of the variable



The application of this variable is very simple. Before you launch the client, you must type the following command, then run the client as normal:

Code :
export MPICH_NO_LOCAL=1


This command must be typed in the same window that will be used to launch the client. If you use an automated script, place it before the command to launch the client. If you run the client in a terminal, type and execute the command before running the client (in the same terminal window).

Real-world results



The tests were conducted on the following setup:
  • VMware workstation 6.5.2, 640MB RAM, 2 cores
  • Windows XP SP3 host OS
  • Ubuntu Server 9.04 64-bit guest OS, based on kernel 2.6.28-15-server
  • Linux core A2 2.10, Project: 2662 (Run 1, Clone 98, Gen 56)
  • Q6600 processor at 3GHz (333*9), with 800MHz DDR2 RAM (CAS4-4-4-10


Results before applying the variable



Quotation:
[17:14:20] Completed 7500 out of 250000 steps (3%)
[17:27:42] Completed 10000 out of 250000 steps (4%)
[17:41:06] Completed 12500 out of 250000 steps (5%)
[17:55:31] Completed 15000 out of 250000 steps (6%)


Time per frame : 13:22, 14:24 and 14:25.


Click to enlarge the image.


In this configuration, it would seem that the process is assigned to a core (column P is the core on which the process runs), and the other three processes share the second core.

Results after applying the variable



Quotation:
[19:41:23] Completed 35000 out of 250000 steps (14%)
[19:53:42] Completed 37500 out of 250000 steps (15%)
[20:05:59] Completed 40000 out of 250000 steps (16%)
[20:17:53] Completed 42500 out of 250000 steps (17%)


Time per frame : 12:19, 12:17 and 12:54


Click to enlarge the image.


Here, the processes are properly distributed: two per core.

Conclusion



Finally, the gain varies between 28 seconds (3.5%) and 2 minutes 8 seconds (14.8%) per frame. The gain is significant, and the first tests by users seem to imply that performance is back to the level of core v2.08. Considering the simplicity of application, why deprive yourself?

Warning!This optimization has a downside: it will make the core sensitive to network events (disconnected cable/wifi, DHCP renewal etc) which can cause the core to crash and resulting loss of the WU. To avoid this problem on a machine connected via wifi it is recommended to use a fixed IP for your machine.