Update SecureUnpickler disallowed packages (#611)
This PR updates the `pyrosetta.secure_unpickle.SecureUnpickler` class to
block some additional callable targets via the `pickle` module,
including `numpy.load` and `pandas.read_pickle` modules. Unit tests
added herein demonstrate that secure `numpy`/`pandas` modules like
`numpy.array` and `pandas.DataFrame` are still deserializable.
Updated PARCS applications and IMMS_CCS score function (#609)
This application builds on the existing PARCS (parcs_ccs_calc.cc)
application and the IMMS_CCS energy term originally developed by
smturzo.
I extended PARCS to support multimeric protein complexes, enabling
simulation of PARCS CCS data for input structures containing multiple
chains. In addition, the existing IMMS_CCS energy term, which was
previously limited to monomers, was generalized for complexes through
the introduction of a new IMMS_ComplexCCS_Energy term. I also
implemented a new CCS_IMMS_with_CryoEMEnergy score term that integrates
experimental CCS data with cryo-EM information.
Method: For PARCS multimer support, I introduced a boolean flag
(-multimer) to the existing PARCS application. When enabled, the
algorithm predicts CCS values for multimeric assemblies by
reparameterizing the original CCS calculation.
For the IMMS-based energy terms, I developed on the existing
CCS_IMMSEnergy implementation
(source/src/core/energy_methods/CCS_IMMSEnergy.cc/.hh) by adding new
energy classes:
* CCS_IMMSComplexEnergy, which enables CCS-based scoring for protein complexes
* CCS_IMMS_with_CryoEMEnergy, which incorporates cryo-EM restraints alongside experimental CCS data
Integration test: I did integration just like how it was done for
monomers with additional flag. for multimer The test passed.
Fix dropped settings issues in HighResDocker (#520)
The copy constructor of HighResDocker was not copying over the resfile_
member, which means it was ignoring that setting. Since the copy
constructor is effectively a straight member-by-member copy, we can
simply delete it and rely on the autogenerated copy constructor.
Additionally, I noticed that the initialize_from_options() function was
declaring local variables, rather than changing the member variables.
Fix this.
Supporting task retries in PyRosettaCluster (#605)
`PyRosettaCluster` supports running tasks on available compute
resources; however, often it's more economical to run tasks on
preemptible compute resources, such as cloud spot instances or backfill
queues. This PR exposes Dask's task retry API via the
`PyRosettaCluster.distribute` method, allowing configuration of the
number of automatic retries for each submitted task. When the `retries`
keyword argument parameter is set, `PyRosettaCluster` will reschedule
failed tasks up to the specified number of times if compute resources
are reclaimed midway through a protocol.
This PR also adds a logging warning if using the `resources` keyword
argument with `dask` version `<2.1.0`.
Add beta_jan25 energy function (#548)
The aim of this PR is to add `beta_jan25` to the `rosetta` source code. This is an updated version of `beta_nov16`.
We will shortly post a manuscript describing how we developed `beta_jan25`. They key updates are to the LJ potential. We identified steric clashing in proteins that were relaxed or designed using `beta_nov16`. We identified examples of this problem in a high-quality benchmark from the `dualoptE` protocol used to train the energy function. We then used this benchmark, and the others in `dualoptE`, to refit a small number of LJ parameters. The refitting largely eliminated the clashing problem, and `beta_jan25` is as good or better than `beta_nov16` when assessed on multiple benchmarks using validation data.
Frank also advised me that -gen_potential should stay the same (on top of beta_nov16), and we decided to have plain -beta invoke beta_jan25.
Moving the TestEnvironmentReproducibility unit test, and updating environment export commands (#567)
The aim of this PR is to:
- Update the pixi environment export command to save the `pixi.lock`
file, rather than the generic YAML file export which does not pin
package versions necessary for environment reproducibility.
- Add support for the pixi `--manifest-path` flag and the uv `--project`
flag, so that `PyRosettaCluster` simulations can be run from outside of
the pixi manifest or uv project directories.
- Add support to cache the pixi manifest file and uv `pyproject.toml`
file to PyRosettaCluster results, which are necessary to reproduce the
pixi/uv project in conjunction with the cached `pixi.lock` file or
`requirements.txt` file.
- Sanitize conda channels of username/password credentials instead of
stripping out the entire URL from the exported environment file, since
PyRosetta no longer requires a username/password for installatioin,
making the simulation reproduction lifecycle simpler by obviating the
need to reconfigure the conda channel when recreating a virtual
environment.
- Update warning/exception messages for pixi, uv, and mamba, instead of
just expecting conda.
- Move the `pyrosetta.distributed.cluster.recreate_environment` function
to the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository. The reason is that we encountered the "chicken or the
egg" problem, where the user must install PyRosetta (typically through
an environment manager like pixi, uv, conda, or mamba) in order to gain
access to the `recreate_environment` function, but then the
`recreate_environment` function cannot be run from inside a virtual
environment, since inside the function it calls `pixi install ...` and
`uv venv ...` which can implement global interpreter locks that hang
indefinitely because the outer environment is already activated.
Therefore, the code needs to be run from the system python interpreter,
which requires the `recreate_environment` function to exist in a
separate repository, as it is not functional in the current state.
- Move the
`pyrosetta.tests.distributed.cluster.test_reproducibility.TestEnvironmentReproducibility`
unit tests (and supporting modules) into the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository. These unit tests do not currently run on the
Benchmark server (due to its implementation of a pure virtual
environment that does not install `conda`, `mamba`, `uv`, or `pixi`) and
were originally added to test the `recreate_environment` function from
the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository (however that presented a second "chicken or the egg"
problem which will be solved in a separate PR for that repository).
The following is unrelated, but updated while here:
- Add a more helpful exception when using `LocalCluster(security=True)`
and the `cryptography` package is not installed.
Add uv/conda/mamba unit tests for PyRosettaCluster virtual environment recreation (#560)
This PR is a follow-up to PR #536 to add unit tests for `uv`, `conda`,
and `mamba` workflows in the PyRosettaCluster simulation reproduction
lifecycle. In particular, this PR changes the
`pyrosetta.distributed.cluster.recreate_environment` function to support
recreating a `conda`/`mamba` virtual environment in a local prefix
directory rather than via a provided environment name to the global
`conda`/`mamba` environment context, which mimics the `uv` and `pixi`
workflows more closely.
Adding dask security to PyRosettaCluster (#531)
A primary feature of `PyRosettaCluster` is that arbitrary user-provided
PyRosetta protocols are pickled, sent over a network, and unpickled,
which allows the user to run customized macromolecular design and
modeling workflows. If the user is operating `PyRosettaCluster` behind a
trusted private network segment (i.e., a firewall), the current
implementation is already secure from external threats (such as
eavesdropping, tampering or impersonation). However, in cases of running
`PyRosettaCluster` without a truly isolated and trusted environment, the
`dask` library can be configured to use TLS/SSL communication between
network endpoints for authenticated and encrypted transmission of data.
This PR aims to integrate Dask's TLS/SSL communication into
`PyRosettaCluster`, as well as implement a few additional security
measures:
1. Adds a `security` keyword argument to `PyRosettaCluster`, which can
accept a `dask.distributed.Security()` object. Alternatively, it accepts
a `bool` object, where if `True` we use the `cryptography` package
through the `dask` and `dask-jobqueue` APIs to generate a temporary
`dask.distributed.Security()` object for the simulation. Because
`PyRosettaCluster` supports remote dask worker instantiation via the
`dask-jobqueue` module, security is now enabled by default for the use
of remote clusters (such as `SLURMCluster`), and thus this PR adds
[cryptography](https://pypi.org/project/cryptography/) as a required
package for the `pyrosetta.distributed` framework (note that there are
very few `cryptography` dependencies, only including `cffi`, and
`openssl` which already ships with standard Python installations).
2. Adds a `pyrosetta.distributed.cluster.generate_dask_tls_security()`
function, which uses the OpenSSL executable that ships with standard
Python installations (due to the native python `ssl` library) to
generate a pre-configured `dask.distributed.Security()` object with the
necessary key/certificate pairs.
3. Enables Hash-based Message Authentication Code (HMAC)-SHA256
verification of `cloudpickle`d data (including the arbitrary
user-provided PyRosetta protocols and task `kwargs`) between network
endpoints (including the host node process, each dask worker process,
and the `billiard` subprocesses; i.e., client ↔ worker, client ↔
subprocess), where the cryptographic pseudo-random key is sent to dask
workers out-of-band using a dask worker plugin.
4. Adds nonce caching on the host node process and all worker processes
if security is disabled, with a `max_nonce` keyword argument that allows
setting the maximum nonce cache size in each process. Nonces are unique
keys added to each distributed message over the network (see the
`cryptography` package
[Glossary](https://cryptography.io/en/latest/glossary/) for more
information), where if the same nonce is encountered twice in the nonce
cache, it may indicate a replay attack and the simulation is
intentionally terminated for security reasons. Note that nonce caching
is disabled if dask security is already enabled, since the nonce caches
may add several additional MB of memory per process (which is not much).
Integrating PyRosetta initialization files into PyRosettaCluster (#511)
The purpose of this PR is to support several new features:
1. Adds an `output_init_file` instance attribute to `PyRosettaCluster`,
enabling dumping of a `.init` or `.init.bz2` file upon instantiation.
2. Adds `author`/`email`/`license` instance attributes to
`PyRosettaCluster`, which are cached in the `.init` or `.init.bz2` file
and output decoy and scorefile metadata.
3. Enables the `input_file` keyword argument of the
`pyrosetta.distributed.cluster.reproduce` method to accept a `.init` or
`.init.bz2` file that initializes PyRosetta before simulation
reproduction.
- Also adds a `skip_corrections` keyword argument to enable skipping
ScoreFunction corrections so that the reproduced results may be used for
successive reproductions.
4. Adds a `pyrosetta.distributed.cluster.export_init_file` function that
enables exporting an output decoy (in `.pdb`, `.pdb.bz2`, `.b64_pose`,
`.b64_pose.bz2`, `.pkl_pose`, `.pkl_pose.bz2` format) to a `.init` or
`.init.bz2` file format.
5. Adds a `norm_init_options` instance attribute to `PyRosettaCluster`,
enabling normalization of the task's PyRosetta initialization options.
This optional convenience feature takes advantage of the
`pyrosetta.get_init_options` method to update the `options` and
`extra_options` keyword arguments of each task after PyRosetta
initialization in the `billiard` subprocess on the dask workers, which
expands option names and relativizes any input files and directories to
the `billiard` subprocess current working directory. Relativized paths
are ideal for reproduction of simulations by a second party on a
different filesystem.
6. Adds `pyrosetta.distributed.io.read_init_file` and
`pyrosetta.distributed.io.init_from_file` functions, which handle
`.init` and `.init.bz2` files.
Please note that this PR also depends on PR #462 supporting a
`.b64_pose` and `.pkl_pose` file outputs in `PyRosettaCluster`. The
impetus for supporting a `.init` file in the `PyRosettaCluster`
simulation reproduction life cycle is that loading a `.b64_pose` or
`.pkl_pose` file into memory requires that PyRosetta is initialized with
the same residue type set as that used to save the `.b64_pose` or
`.pkl_pose` file (otherwise PyRosetta doesn't know how to reconstruct
the `Pose`, resulting in a segfault). Effectively, a user remains locked
out of `.b64_pose` and `.pkl_pose` files unless PyRosetta is initialized
correctly, which can be easily accomplished by PyRosetta initialization
with a `.init` or `.init.bz2` file. Hence, if a user decides to output
results in `.b64_pose` or `.pkl_pose` format, the `.init` file can then
be used to initialize PyRosetta identically and load the `.b64_pose` or
`.pkl_pose` file into memory.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>