Revisions №57912

branch: master 「№57912」
Commited by: Vikram K. Mulligan
GitHub commit link: 「de0d8deeed6e315b」「№463」
Difference from previous tested commit: code diff
Commit date: 2015-06-10 10:36:53

Merge pull request #463 from RosettaCommons/vmullig/jd2_joblist_memory_issue Trying to get the JobDistributor working on Blue Gene with ridiculously large nstruct values. On Argonne's Blue Gene/Q system, RosettaScripts now works, but the app fails before any jobs are distributed if nstruct is set too high (~10,000,000 or higher). It seems that the GenericJobInputter creates a list of jobs at the start of its run, and this list fills up memory and causes Rosetta to crash. I'm going to try to fix this (though I might need some guidance on this). Update: In the interests of getting this out for the largest number of users, I'm going to merge this at this stage (the current features are stable and usable), and complete the unchecked tasks later. Tasks: -- Implement a new LargeNstructJobInputter (based on the GenericJobInputter) that generates an initial list of jobs, then clears this and generates new lists as needed. -- Hmm... I'll need a way of managing the jobs_ list in the JobDistributor intelligently as jobs are added or removed... So implement a new JobsContainer class for this purpose. -- Let the JobsContainer work with the JobInputter to add or remove jobs. -- Add an option to determine the nstruct value over which the LargeNstructJobInputter is used. -- Fix the cxx11thread build. -- Ensure that all integration test changes are expected, or fix them if they're actual problems. -- Address the const-access problem in all get_new_job_id() functions (the JobsContainer object needs non-const access since it has to update the jobs list). -- Check that the MPIWorkPoolJobDistributor is also properly marking jobs as deletable, and work properly with the LargeNstructJobInputter. -- Get the MPIWorkPoolJobDistributor to do the following: -- Mark jobs as deletable on the master node. -- Synchronize the marking of jobs as deletable across slave nodes. -- Only synchronize when strictly necessary (since the synchronization information takes significant bandwidth). TODO Check that the MPIFileBufJobDistributor is also properly marking jobs as deletable, and work properly with the LargeNstructJobInputter. TODO Get the MPIFileBufJobDistributor to do the following: TODO Mark jobs as deletable on the master node. TODO Synchronize the marking of jobs as deletable across slave nodes. -- Test this on Blue Gene before merging with master. -- Works with FASTA input, PDB output. TODO Works with FASTA input, silent file output. TODO Works with PDB input, PDB output. TODO Works with PDB input, silent file output. TODO Works with silent file input, PDB output. TODO Works with silent file input, silent file output. TODO Add integration test for non-MPI job distributor with LargeNstructJobInputter. -- Figure out how the heck to cover the MPI JobDistributor changes with tests. Also: -- Disable MPI timeout by default. As far as I can tell, this is only used by the MPIFileBufJobDistributor, and it's poorly-thought-out: it brings down the run if any job takes X times the average length of all jobs completed (where the default value for X is 3). That is, if job 2 takes more than 3 times longer than job 1 (e.g. if job 1 fails filters but job 2 passes), the whole thing comes tumbling down. This creates unexpected failures for users, and I think it should only be on if a user explicitly turns it on. -- Updated one of Lei's pilot apps so that Lei, Una, and I can use it on Blue Gene. Tasks above that are marked TODO will be completed in a future merge. This merge does not leave anything broken that wasn't broken before; it does fix job distribution with ridiculously high nstruct values in the special case of FASTA input and PDB output.

Summary

...