Updating documentation and typing for PyRosettaCluster (#646)
This PR adds several improvements to the _PyRosettaCluster_ framework.
### Major changes (maintaining runtime functionality)
1. Improve formatting and clarity of docstrings for Sphinx-based
PyRosetta documentation.
2. Consolidate typing aliases into a single
`pyrosetta.distributed.cluster.type_defs` module for easier
maintainability.
3. Add typing to attributes in `attrs` classes.
4. Clean up imports and typing (preserving Python-3.8 compatibility)
5. Add `pickle` and `cloudpickle` warnings to relevant docstrings.
6. Clarify and format logging and error messages.
7. Add `PackedPoseHasher` and `secure_read_pickle` to the top-level
`pyrosetta.distributed.cluster` namespace for easier usability.
### Minor changes (updating runtime functionality)
8. Update usage of `dataclasses` module to use the `attrs` package
instead. Also package Dask task arguments in a new `ExtraArgs` `attrs`
class (with slots and typing) instead of a simple dictionary.
9. Slightly loosen a validation: test whether two Base64-encoded pickled
`Pose` objects are identical -> test whether the scientific state of two
`Pose` objects are identical.
10. Handle edge case of single dictionary outputs from PyRosetta
protocols decorated with the `reserve_scores` decorator.
11. Handle edge case of unordered iterables (i.e., `set` objects) and
raise exceptions: input PyRosetta protocols, and outputs produced by
PyRosetta protocols, must be ordered for reproducibility purposes.
12. Fix exception handling for `Exception` rather than `BaseException`.
Update SecureUnpickler disallowed packages (#611)
This PR updates the `pyrosetta.secure_unpickle.SecureUnpickler` class to
block some additional callable targets via the `pickle` module,
including `numpy.load` and `pandas.read_pickle` modules. Unit tests
added herein demonstrate that secure `numpy`/`pandas` modules like
`numpy.array` and `pandas.DataFrame` are still deserializable.
Updated PARCS applications and IMMS_CCS score function (#609)
This application builds on the existing PARCS (parcs_ccs_calc.cc)
application and the IMMS_CCS energy term originally developed by
smturzo.
I extended PARCS to support multimeric protein complexes, enabling
simulation of PARCS CCS data for input structures containing multiple
chains. In addition, the existing IMMS_CCS energy term, which was
previously limited to monomers, was generalized for complexes through
the introduction of a new IMMS_ComplexCCS_Energy term. I also
implemented a new CCS_IMMS_with_CryoEMEnergy score term that integrates
experimental CCS data with cryo-EM information.
Method: For PARCS multimer support, I introduced a boolean flag
(-multimer) to the existing PARCS application. When enabled, the
algorithm predicts CCS values for multimeric assemblies by
reparameterizing the original CCS calculation.
For the IMMS-based energy terms, I developed on the existing
CCS_IMMSEnergy implementation
(source/src/core/energy_methods/CCS_IMMSEnergy.cc/.hh) by adding new
energy classes:
* CCS_IMMSComplexEnergy, which enables CCS-based scoring for protein complexes
* CCS_IMMS_with_CryoEMEnergy, which incorporates cryo-EM restraints alongside experimental CCS data
Integration test: I did integration just like how it was done for
monomers with additional flag. for multimer The test passed.
Fix dropped settings issues in HighResDocker (#520)
The copy constructor of HighResDocker was not copying over the resfile_
member, which means it was ignoring that setting. Since the copy
constructor is effectively a straight member-by-member copy, we can
simply delete it and rely on the autogenerated copy constructor.
Additionally, I noticed that the initialize_from_options() function was
declaring local variables, rather than changing the member variables.
Fix this.
Supporting task retries in PyRosettaCluster (#605)
`PyRosettaCluster` supports running tasks on available compute
resources; however, often it's more economical to run tasks on
preemptible compute resources, such as cloud spot instances or backfill
queues. This PR exposes Dask's task retry API via the
`PyRosettaCluster.distribute` method, allowing configuration of the
number of automatic retries for each submitted task. When the `retries`
keyword argument parameter is set, `PyRosettaCluster` will reschedule
failed tasks up to the specified number of times if compute resources
are reclaimed midway through a protocol.
This PR also adds a logging warning if using the `resources` keyword
argument with `dask` version `<2.1.0`.
Add beta_jan25 energy function (#548)
The aim of this PR is to add `beta_jan25` to the `rosetta` source code. This is an updated version of `beta_nov16`.
We will shortly post a manuscript describing how we developed `beta_jan25`. They key updates are to the LJ potential. We identified steric clashing in proteins that were relaxed or designed using `beta_nov16`. We identified examples of this problem in a high-quality benchmark from the `dualoptE` protocol used to train the energy function. We then used this benchmark, and the others in `dualoptE`, to refit a small number of LJ parameters. The refitting largely eliminated the clashing problem, and `beta_jan25` is as good or better than `beta_nov16` when assessed on multiple benchmarks using validation data.
Frank also advised me that -gen_potential should stay the same (on top of beta_nov16), and we decided to have plain -beta invoke beta_jan25.
Moving the TestEnvironmentReproducibility unit test, and updating environment export commands (#567)
The aim of this PR is to:
- Update the pixi environment export command to save the `pixi.lock`
file, rather than the generic YAML file export which does not pin
package versions necessary for environment reproducibility.
- Add support for the pixi `--manifest-path` flag and the uv `--project`
flag, so that `PyRosettaCluster` simulations can be run from outside of
the pixi manifest or uv project directories.
- Add support to cache the pixi manifest file and uv `pyproject.toml`
file to PyRosettaCluster results, which are necessary to reproduce the
pixi/uv project in conjunction with the cached `pixi.lock` file or
`requirements.txt` file.
- Sanitize conda channels of username/password credentials instead of
stripping out the entire URL from the exported environment file, since
PyRosetta no longer requires a username/password for installatioin,
making the simulation reproduction lifecycle simpler by obviating the
need to reconfigure the conda channel when recreating a virtual
environment.
- Update warning/exception messages for pixi, uv, and mamba, instead of
just expecting conda.
- Move the `pyrosetta.distributed.cluster.recreate_environment` function
to the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository. The reason is that we encountered the "chicken or the
egg" problem, where the user must install PyRosetta (typically through
an environment manager like pixi, uv, conda, or mamba) in order to gain
access to the `recreate_environment` function, but then the
`recreate_environment` function cannot be run from inside a virtual
environment, since inside the function it calls `pixi install ...` and
`uv venv ...` which can implement global interpreter locks that hang
indefinitely because the outer environment is already activated.
Therefore, the code needs to be run from the system python interpreter,
which requires the `recreate_environment` function to exist in a
separate repository, as it is not functional in the current state.
- Move the
`pyrosetta.tests.distributed.cluster.test_reproducibility.TestEnvironmentReproducibility`
unit tests (and supporting modules) into the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository. These unit tests do not currently run on the
Benchmark server (due to its implementation of a pure virtual
environment that does not install `conda`, `mamba`, `uv`, or `pixi`) and
were originally added to test the `recreate_environment` function from
the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository (however that presented a second "chicken or the egg"
problem which will be solved in a separate PR for that repository).
The following is unrelated, but updated while here:
- Add a more helpful exception when using `LocalCluster(security=True)`
and the `cryptography` package is not installed.
Add uv/conda/mamba unit tests for PyRosettaCluster virtual environment recreation (#560)
This PR is a follow-up to PR #536 to add unit tests for `uv`, `conda`,
and `mamba` workflows in the PyRosettaCluster simulation reproduction
lifecycle. In particular, this PR changes the
`pyrosetta.distributed.cluster.recreate_environment` function to support
recreating a `conda`/`mamba` virtual environment in a local prefix
directory rather than via a provided environment name to the global
`conda`/`mamba` environment context, which mimics the `uv` and `pixi`
workflows more closely.