Improving resilience of PyRosettaCluster upon Dask worker preemption (#630)
This PR adds two new features to `PyRosettaCluster` to improve the
resilience of a simulation when using preemptible compute resources
(e.g., cloud spot instances or cluster backfill queues):
**1. Task replication on other Dask workers**: Task replication allows
scattered task input arguments to be recovered if a Dask worker
executing a task is preempted midway through a protocol. The number of
task retries is controlled by the Dask configuration parameter
`distributed.scheduler.allowed-failures`, which may be manually
configured prior to the simulation. Additionally, task replication
requires that Dask's Active Memory Manager first be disabled, since task
replicas consume additional memory per Dask worker. This PR just adds
support for the
[Client.replicate](https://distributed.dask.org/en/stable/api.html#distributed.Client.replicate)
method (controlled via the `max_task_replicas` keyword argument); if
enabled, the Dask scheduler can attempt task retries if task arguments
are recoverable.
**2. Client-side task registry**: As a fallback plan, if scattered task
input arguments cannot be recovered after Dask worker preemption (e.g.,
the task replication factor is set too low), then this PR adds
lightweight infrastructure for a durable task registry on the
client-side head node (controlled via the `task_registry` keyword
argument). If enabled and the Dask scheduler cannot resubmit the task
(detected by catching the raised `concurrent.futures.CancelledError`
exception), then `PyRosettaCluster` will automatically resubmit the task
using the task input arguments cached in the task registry. This PR
supports constructing the task registry in memory
(`task_registry="memory"`) for relatively few tasks, or on disk
(`task_registry="disk"`) for production simulations.
Both of these features require that user-defined PyRosetta protocols are
side effect-free upon preemption, so that tasks can be restarted without
producing inconsistent external states; therefore, they are disabled by
default and can be configured based on Dask worker memory limits and the
expected compute resource preemption frequency. Additionally, the
`retries` keyword argument documentation of the
`PyRosettaCluster.distribute` method is clarified, as well as other
minor documentation changes.