「view this page in B3 βῆτα server」

Revisions №60422

branch: master 「№60422」
Commited by: Andrew Leaver-Fay
GitHub commit link: 「c5e5ee8f7054a89b」
Difference from previous tested commit:  code diff
Commit date: 2018-09-25 09:39:24
linux.clang linux.gcc linux.srlz mac.clang
linux.PyRosetta.unit linux.gcc.python36.PyRosetta4.unit mac.PyRosetta.unit build.clean.debug cppcheck mysql postgres linux.zeromq.debug mpi mpi.serialization linux.icc.build.debug OpenCL build.header build.levels ninja graphics static linux.ui mac.ui beautification serialization integration.mpi integration.release_debug integration performance profile release.source linux.clang.score linux.gcc.score mac.clang.score linux.scripts.pyrosetta scripts.rosetta.parse scripts.rosetta.validate scripts.rosetta.verify linux.clang.unit.release linux.gcc.unit.release

Merge PR #3492 (aleaverfay/jd3_fix_mpi_hang_in_distributed_output2) Fixing a race condition in the way JD0 sent spin-down signals to remote nodes at the very end of a trajectory. Previously, JD0 would tell any node to spin down after that node had completed outputting the last result it needed to (i.e. and there were no more results to output), including the archive nodes. However, if an archive node spins down, then it can't serve up the results that the worker/output nodes might be requesting. If a worker/output node requested a job of the archive after it had spun down, it would simply hang waiting for a reply. This is a particularly poor moment in a job for results to get lost: they've already been generated, all the computational cost of creating them has been paid! The solution is to delay spinning down the archive nodes until after all of the worker/output nodes have reported that they have completed their work. Super simple. There is a new unit test that a) works in the new code, and b) fails in the old code. In particular, the unit test ensures that the spin-down signal to the archive is sent after the spin-down signal to the worker/outputter. To verify that the unit test fails in the old code yourself, you can cherry pick 674b9a66e on top of dc1c89dea8. (In order to run the JD3WorkPoolJobDistributor unit tests, you need to use the debug + serialization build). I should definitely have looked into this more closely months ago when @thieker brought it to my attention!

Test: mac.clang.integration

Failed sub-tests (click for more details):
glycan_tree_relax hotspot_hashing
Test: linux.clang.performance

Failed sub-tests (click for more details):
Test: mac.clang.unit

Failed sub-tests (click for more details):