「view this page in B3 βῆτα server」

Revisions №59889

branch: master 「№59889」
Commited by: Andrew Leaver-Fay
GitHub commit link: 「c8bbc8d1fdb544f2」 「№2709」
Difference from previous tested commit:  code diff
Commit date: 2017-11-29 15:03:19

Merge pull request #2709 from RosettaCommons/aleaverfay/jd3_fix_mpi_hang_in_distributed_output Fix hang in JD3+MPI distributed output I was treating MPI like an asynchronous communication system, but in fact, MPI_Send will block until the buffer that it's using can be reused. What happened was, node0 would send out messages to all the archives saying "Hey, here's an output that you should write to disk" and then after they were all sent, then it would return to its listening loop. The archives would take these messages one at a time, write an output, and then send a "hey, boss, just finished that output you told me to do" message back to node0 before going back and seeing if node0 had any more output messages that it had sent. What would happen is that the MPI_Send calls would buffer the messages to a certain point, and then it would block until the messages from the remote node were processed -- which they couldn't be, because the remote nodes would also be blocking waiting for node0 to read their "hey boss" messages. Thus, deadlock. New plan. * Node0 sends output work to the archives one at a time * Node0 does not output itself * Worker nodes can be given output tasks (in addition to their regular output) * Worker nodes can relieve the archive nodes of some of their output tasks (so the archive nodes don't block computation, because the worker nodes often need to talk to the archive nodes!) Users can now say "I want X% of my nodes to perform output" so that by the time you're up at 100%, then all nodes except the head node are outputting and output is basically the same in JD3 as it was in JD2. Except in JD2, there was no option to have only 10% of the nodes perform output. @JackMaguire @jadolfbr

Vikram K. Mulligan 6 years
Andrew -- YOU failed to beautify? :D
Andrew Leaver-Fay 6 years
I'm perplexed -- I beautified this branch more than a few times.
Andrew Leaver-Fay 6 years
I am pretty sure we cannot beautify on the testing server with the new code-reviewed+PR-only commits to master restriction we just enabled. I'll open a PR.
...