Revisions №60733

branch: master 「№60733」
Commited by: Andrew Leaver-Fay
GitHub commit link: 「d73d4f6c6c7f5f09」
Difference from previous tested commit: code diff
Commit date: 2019-05-08 08:44:05

JD3 Checkpointing (#3939) Checkpoint progress in JD3 when using MPI. Currently works for multistage_rosetta_scripts. Checkpointing is a technique where the state of the system at a particular point in time is saved in a stable way (e.g. on disk) so that if the job dies or is killed, then the work up until the point of the checkpointing is not lost. Checkpointing in JD3 is managed by the JobDistributor. The other classes involved in the checkpointing (including the JobQueen) do not need to think about how checkpointing will work, but merely how to serialize and deserialize their data. (There is a notable exception here, discussed below). The user will tell the job distributor to checkpoint every certain number of minutes (flag `-jd3::checkpoint_period <minutes>`), e.g. 30 minutes, and the job distributor on node 0 will look at the (wall) clock each time it can* and when 30 minutes have passed since the last checkpoint was made, it will ask the job distributors on the archive nodes (if any) to begin checkpointing, and it will serialize its data to disk. If the job should be killed before it completes, then the user can restart the job (taking care to use the same command-line flags as before) with the additional flag `-jd3::restore_from_checkpoint`. The JobDistributor on node 0 will deserialize the data in the checkpoint file and then resume execution of the jobs from that point where the checkpoint was taken. Some work would have been lost: the work that took place between the last checkpoint and the time the job died. (* The JobDistributor on node 0 spends most of its time in a while loop where it waits to hear from other MPI processes and then responds to their request. At the top of this while loop is a "receive mpi integer message from anyone" call which blocks until some node sends node 0 an integer. The JobDistributor on node 0 might wait inside this blocking receive call beyond the moment when the wall clock would say that a new checkpoint is due. The JobDistributor has to wait until someone sends it a message, then the JobDistributor will process that message. After it has processed the message, but before it re-invokes the blocking receive request, it will look at the clock and checkpoint if necessary. For this reason, the JobDistributor will not checkpoint at the exact moment it becomes possible to checkpoint. If you have a job that will be killed at exactly 1 hour, e.g., then you should not set the checkpoint interval to 59 minutes: the JobDistributor might not ever checkpoint before the job is killed.) Not all MPI Nodes serialize their data: only the master node and the archive nodes. The worker nodes do not need to store their data: they are presumed to have no significant state. One advantage of this system is that you can restore from a checkpoint with a different number of worker nodes. (You need to have the same number of archive nodes as the original job). The only JobQueen to be checkpointed is the JobQueen on the master node, (node 0, thus, we call this JobQueen JQ0). The job distributor makes this pledge: if the JobQueen delivered messages to the JobDistributor, then the JobDistributor is responsible for ensuring those messages are acted on. If the JobQueen delivers a LarvalJob to the JobDistributor, then the JobDistributor ensures those LarvalJobs get run. If the JobQueen delivers a JobOutputSpecification to the JobDistributor, the JobDistributor ensures that those outputs get written. The JobDistributor does not guarantee, however, that the JobQueen's discard messages are delivered. The idea is this: the discard messages are to remove lazily-loaded data from memory after that data are no longer needed. If the original process has died and the process re-launched, then the lazily-loaded data will not be in memory when the job starts again. Remember, the JQs on the worker nodes are not checkpointed. The exception to the idea that the JobQueen does not need to think about how checkpointing should work is that if she has any data that cannot / should not be serialized, then the JobQueen should gracefully handle events where that data is surprisingly absent. E.g., let's say the JobQueen has a pointer to a big blob of data, BBOD_OP. If the JQ doesn't serialize that data, then during the restore-from-checkpoint process, that BBOD_OP will not get set. In that case, the JobQueen should make sure to load that data before trying to use it. In this way, the JQ should be minimally aware of how checkpointing might work. What are examples of this kind of data? If the JobQueen holds a RosettaScriptsParserOP, e.g., that class is currently not serializable. (Let's assume that it could not be serializable, even if that might not be true of this class). In this case, the RosettaScriptsParser serves the purpose of storing the libxml2 objects defining the schema so that it does not need to be regenerated repeatedly (since this step can take ~10 seconds). One option for the `save`/`load` methods of the JobQueen would be to 1. not archive the RosettaScriptsParser in its `save` method, and 2. to create a new (empty) RosettaScriptsParser in its `load` method. This would guarantee that the RSPOP was never null. Alternatively, step 1 could remain the same, but for step 2, the JQ could set its RSPOP to null. Then the code that intends to use the RSPOP would have to surround its usage with an `if (rspop_ == nullptr ) { rspop_ = make_shared< RSP >(); }`. Probably the first option is better! Some points about restoring from a checkpoint * The number of archive nodes must be the same, but the total number of nodes can be different * It is possible to enable `-jd3::archive_on_disk` in the restored job even if that flag was not present on the command line for the first job (which might be useful if your job died the first time because it ran out of memory on the archive nodes!) * If you are writing pdbs to output silent files, then jobs that were output after the checkpoint was created might be written a second time to the same or different silent file when restoring from that checkpoint. * You cannot add a new option to Rosetta, recompile, and then try to restore a job from a previously generated checkpoint. This is due to the way the OptionCollection is created: each OptionKey is assigned an integer from a static counter at program load. If the integer assigned to a particular option key is different when trying to restore from a checkpoint, the OptionCollection will misbehave. (I can imagine a scenario in which an OptionCollection serializes itself as a string resembling the command-line that would generate the state of that OptionCollection and then deserialize itself by re-interpretting that string: this would fix this limitation. Studying the OptionCollection more closely just now, however, makes my imagination seem unrealistic. The OptionCollection and the OptionKey system is configured in a way so as to make a option-key-name-string-based serialization strategy impossible.).

Summary

...