Merge pull request #2430 from RosettaCommons/vmullig/multithreaded_peppredict
Add multithreading support to the simple_cycpep_predict application
The simple_cycpep_predict application is used to predict peptide structures. We currently run it on the Baker lab cluster ("the Digs"), the University of Washington "Hyak" cluster, the Argonne "Mira" Blue Gene/Q system, and the Amazon Web Services (AWS) cloud computing platform (plus through the Berkely Open Infrastructure for Network Computing, BOINC). In many of these cases, the number of parallel prediction jobs that we can run is limited by node memory rather than CPUs. With MPI-based parallelism, each process must load a separate copy of the Rosetta database and store it in node memory, resulting in gigabytes of duplicated information taking up space in node memory.
This pull request aims to add multithreading support to the existing hierarchical MPI-based job distribution scheme used by simple_cycpep_predict. Note that this app does not use JD2 or JD3. Its job distributor, although much less general than JD2 or JD3, was intended to test some of the features that we hoped to implement in JD3. The immediate practical benefit is that this will allow us to use our available computing resources much more efficiently -- particularly the Digs, Mira, and AWS. This means that we can attempt more jobs, or larger jobs. (It's also a chance for me to discover mistakes to be avoided in the JD3 multithreaded/MPI hierarchical job distributor that will one day exist.)
Tasks:
- [x] Add suitable blocks bracketed by `#ifdef MULTI_THREADED`.
- [x] Add input controls for multithreading (option for number of threads per slave process). Note: for now, the emperor and master-layer processes will not launch worker threads.
- [x] Add final layer of thread-based job distribution. (Ensure that only one thread per process makes MPI calls).
- [x] Ensure that each thread has a unique random generator initialized with a unique seed that's properly incremented based on thread index, MPI rank, and job index.
- [x] Fix problem with stopping after nstruct.
- [x] Fix duplicated pose output bug. (Was caused by improperly setting index of jobs on slave nodes; not a thread-safety bug after all.)
- [x] Fix issue #2442 (multithreaded efficiency of rotamer library access), since this really does seem to be hindering efficiency here, based on benchmarks on the DIGs.
- [x] Make initialization of HbondTypeManager threadsafe.
- [x] Beauty.
- [x] Ensure that standard MPI-based, non-threaded job distribution is not broken in non-threaded builds.
- [x] Documentation.
TODO:
- Test c++11thread compilation on Blue Gene/Q.
- Test and benchmark on the DIGs.
- Test and benchmark on Cetus. Can we get up to 64 threads per node? (See also pull request #2469).
- Test and benchmark on Mira. (See also pull request #2469).
- Think about how to cover this with an integration test...
- Find other issues that are still hampering efficiency.
Note: this pull request is branched off of vmullig/threadsafe_tracers. Pull requests #2420 and #2416 must be merged before this one. (DONE.)