Revisions №60422

branch: master 「№60422」
Commited by: Andrew Leaver-Fay
GitHub commit link: 「c5e5ee8f7054a89b」
Difference from previous tested commit: code diff
Commit date: 2018-09-25 09:39:24

Merge PR #3492 (aleaverfay/jd3_fix_mpi_hang_in_distributed_output2) Fixing a race condition in the way JD0 sent spin-down signals to remote nodes at the very end of a trajectory. Previously, JD0 would tell any node to spin down after that node had completed outputting the last result it needed to (i.e. and there were no more results to output), including the archive nodes. However, if an archive node spins down, then it can't serve up the results that the worker/output nodes might be requesting. If a worker/output node requested a job of the archive after it had spun down, it would simply hang waiting for a reply. This is a particularly poor moment in a job for results to get lost: they've already been generated, all the computational cost of creating them has been paid! The solution is to delay spinning down the archive nodes until after all of the worker/output nodes have reported that they have completed their work. Super simple. There is a new unit test that a) works in the new code, and b) fails in the old code. In particular, the unit test ensures that the spin-down signal to the archive is sent after the spin-down signal to the worker/outputter. To verify that the unit test fails in the old code yourself, you can cherry pick 674b9a66e on top of dc1c89dea8. (In order to run the JD3WorkPoolJobDistributor unit tests, you need to use the debug + serialization build). I should definitely have looked into this more closely months ago when @thieker brought it to my attention!

Summary

...