Add uv/conda/mamba unit tests for PyRosettaCluster virtual environment recreation (#560)
This PR is a follow-up to PR #536 to add unit tests for `uv`, `conda`,
and `mamba` workflows in the PyRosettaCluster simulation reproduction
lifecycle. In particular, this PR changes the
`pyrosetta.distributed.cluster.recreate_environment` function to support
recreating a `conda`/`mamba` virtual environment in a local prefix
directory rather than via a provided environment name to the global
`conda`/`mamba` environment context, which mimics the `uv` and `pixi`
workflows more closely.
Supporting mamba, uv, and pixi environment managers in PyRosettaCluster (#536)
`PyRosettaCluster` currently uses the `conda` environment manager for
simulation reproduction life cycles. Herein this PR, we add support for
`mamba`, `uv`, and `pixi` as well. Upon instantiation of
`PyRosettaCluster`, we now look for the operating system environment
variable `PYROSETTACLUSTER_ENVIRONMENT_MANAGER` to determine if the user
wants to use a certain environment manager, or otherwise fallback on
detecting whether `pixi`, `uv`, `mamba`, or `conda` is an executable to
determine which environment manager to use. A second party reproducing a
simulation may then use the same or a different environment manager,
since the raw YAML file string is universal across environment managers
and is cached in the `PyRosettaCluster` decoy output file and/or
scorefile. Note that unit tests are not added herein because the
Benchmark server implements virtual environments (i.e., using `pip`),
and therefore these new functionalities are not testable at the moment
(and so they may be considered experimental until thoroughly tested).
This PR also cleans up some module-level import side effects (e.g.,
running `distributed.get_worker()` upon importing
`pyrosetta.distributed.cluster`) which were subtly affecting the
PyRosetta documentation builds with `sphinx`.
Adding dask security to PyRosettaCluster (#531)
A primary feature of `PyRosettaCluster` is that arbitrary user-provided
PyRosetta protocols are pickled, sent over a network, and unpickled,
which allows the user to run customized macromolecular design and
modeling workflows. If the user is operating `PyRosettaCluster` behind a
trusted private network segment (i.e., a firewall), the current
implementation is already secure from external threats (such as
eavesdropping, tampering or impersonation). However, in cases of running
`PyRosettaCluster` without a truly isolated and trusted environment, the
`dask` library can be configured to use TLS/SSL communication between
network endpoints for authenticated and encrypted transmission of data.
This PR aims to integrate Dask's TLS/SSL communication into
`PyRosettaCluster`, as well as implement a few additional security
measures:
1. Adds a `security` keyword argument to `PyRosettaCluster`, which can
accept a `dask.distributed.Security()` object. Alternatively, it accepts
a `bool` object, where if `True` we use the `cryptography` package
through the `dask` and `dask-jobqueue` APIs to generate a temporary
`dask.distributed.Security()` object for the simulation. Because
`PyRosettaCluster` supports remote dask worker instantiation via the
`dask-jobqueue` module, security is now enabled by default for the use
of remote clusters (such as `SLURMCluster`), and thus this PR adds
[cryptography](https://pypi.org/project/cryptography/) as a required
package for the `pyrosetta.distributed` framework (note that there are
very few `cryptography` dependencies, only including `cffi`, and
`openssl` which already ships with standard Python installations).
2. Adds a `pyrosetta.distributed.cluster.generate_dask_tls_security()`
function, which uses the OpenSSL executable that ships with standard
Python installations (due to the native python `ssl` library) to
generate a pre-configured `dask.distributed.Security()` object with the
necessary key/certificate pairs.
3. Enables Hash-based Message Authentication Code (HMAC)-SHA256
verification of `cloudpickle`d data (including the arbitrary
user-provided PyRosetta protocols and task `kwargs`) between network
endpoints (including the host node process, each dask worker process,
and the `billiard` subprocesses; i.e., client ↔ worker, client ↔
subprocess), where the cryptographic pseudo-random key is sent to dask
workers out-of-band using a dask worker plugin.
4. Adds nonce caching on the host node process and all worker processes
if security is disabled, with a `max_nonce` keyword argument that allows
setting the maximum nonce cache size in each process. Nonces are unique
keys added to each distributed message over the network (see the
`cryptography` package
[Glossary](https://cryptography.io/en/latest/glossary/) for more
information), where if the same nonce is encountered twice in the nonce
cache, it may indicate a replay attack and the simulation is
intentionally terminated for security reasons. Note that nonce caching
is disabled if dask security is already enabled, since the nonce caches
may add several additional MB of memory per process (which is not much).
Defining a PyRosetta build signature specifying extras (#537)
This quick PR defines a unique PyRosetta build signature that differs
from the existing PyRosetta version string in that it contains build
extras as given by `rosetta.utility.Version.extras()`. The aim is to
capture build differences in the official PyRosetta `ml` images from
non-`ml` images. The `PyRosettaCluster(pyrosetta_build=...)` default
keyword argument parameter, and the PyRosetta `.init` file
`pyrosetta_build` key value, are updated herein accordingly.
Here is an example of the new `pyrosetta._build_signature()` function's
output:
```
PyRosetta4.Release.python311.mac.cxx11thread.serialization[extras:cxx11thread+serialization]2025.19+release.1354d05daa4c339d591afeecef3c94ca2d38680e
```
Integrating PyRosetta initialization files into PyRosettaCluster (#511)
The purpose of this PR is to support several new features:
1. Adds an `output_init_file` instance attribute to `PyRosettaCluster`,
enabling dumping of a `.init` or `.init.bz2` file upon instantiation.
2. Adds `author`/`email`/`license` instance attributes to
`PyRosettaCluster`, which are cached in the `.init` or `.init.bz2` file
and output decoy and scorefile metadata.
3. Enables the `input_file` keyword argument of the
`pyrosetta.distributed.cluster.reproduce` method to accept a `.init` or
`.init.bz2` file that initializes PyRosetta before simulation
reproduction.
- Also adds a `skip_corrections` keyword argument to enable skipping
ScoreFunction corrections so that the reproduced results may be used for
successive reproductions.
4. Adds a `pyrosetta.distributed.cluster.export_init_file` function that
enables exporting an output decoy (in `.pdb`, `.pdb.bz2`, `.b64_pose`,
`.b64_pose.bz2`, `.pkl_pose`, `.pkl_pose.bz2` format) to a `.init` or
`.init.bz2` file format.
5. Adds a `norm_init_options` instance attribute to `PyRosettaCluster`,
enabling normalization of the task's PyRosetta initialization options.
This optional convenience feature takes advantage of the
`pyrosetta.get_init_options` method to update the `options` and
`extra_options` keyword arguments of each task after PyRosetta
initialization in the `billiard` subprocess on the dask workers, which
expands option names and relativizes any input files and directories to
the `billiard` subprocess current working directory. Relativized paths
are ideal for reproduction of simulations by a second party on a
different filesystem.
6. Adds `pyrosetta.distributed.io.read_init_file` and
`pyrosetta.distributed.io.init_from_file` functions, which handle
`.init` and `.init.bz2` files.
Please note that this PR also depends on PR #462 supporting a
`.b64_pose` and `.pkl_pose` file outputs in `PyRosettaCluster`. The
impetus for supporting a `.init` file in the `PyRosettaCluster`
simulation reproduction life cycle is that loading a `.b64_pose` or
`.pkl_pose` file into memory requires that PyRosetta is initialized with
the same residue type set as that used to save the `.b64_pose` or
`.pkl_pose` file (otherwise PyRosetta doesn't know how to reconstruct
the `Pose`, resulting in a segfault). Effectively, a user remains locked
out of `.b64_pose` and `.pkl_pose` files unless PyRosetta is initialized
correctly, which can be easily accomplished by PyRosetta initialization
with a `.init` or `.init.bz2` file. Hence, if a user decides to output
results in `.b64_pose` or `.pkl_pose` format, the `.init` file can then
be used to initialize PyRosetta identically and load the `.b64_pose` or
`.pkl_pose` file into memory.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Support saving PackedPose objects and arbitrary score data in PyRosettaCluster (#462)
This PR adds support for `PyRosettaCluster` to output decoys in the
`.pkl_pose` and `.b64_pose` file formats (introduced in #431), as well
as to output scorefiles in pickled `pandas.DataFrame` format (which
supports arbitrary datatypes that have been saved in the
`PackedPose.scores` dictionary as introduced in #430).
Currently, `PyRosettaCluster` instance keyword arguments are saved on
`REMARK` lines in output PDB files, and in this PR `PyRosettaCluster`
instance keyword arguments are technically cached in the
`pyrosetta.rosetta.core.pose.datacache.CacheableDataType.STRING_MAP`
datacache, which results in a `REMARK` line if output as a PDB file, yet
we save it as a `.pkl_pose` or `.b64_pose` file. Therefore, this PR
supports reproducing a decoy from a `.pkl_pose` or `.b64_pose` file that
has been output by `PyRosettaCluster`. The benefit of saving decoys in
the `.pkl_pose` and `.b64_pose` formats over PDB format is that
arbitrary python types are still cached in the `.pkl_pose` and
`.b64_pose` files and do not require JSON-encoding (since JSON-encoding
can't serialize arbitrary data types). Additionally, `.pkl_pose` and
`.b64_pose` files save the exact atomic coordinates of the pose, and
therefore it can be demonstrated that a decoy can be recursively
reproduced _ad infinitum_ through a set of custom PyRosetta protocols
with exact atomic coordinates (note that `PyRosettaCluster` has already
been doing this, but because the PDB file format rounds the atomic
coordinates it was not possible to explicitly show this).
Furthermore, this PR supports saving `PyRosettaCluster` scorefiles as
pickled `pandas.DataFrame` files, which also can store arbitrary python
types from the `PackedPose.scores` dictionary. Decoys can also be
reproduced from a scorefile in pickled `pandas.DataFrame` format (which
supports many additional compression types, including `.gz`, `.bz2`,
`.xz`, `.tar`, `.tar.gz`, `.tar.xz`, and more). Please note that
`pandas` is already a dependency of `dask`, and is therefore already a
dependency of `pyrosetta.distributed`, so new third-party dependencies
are not being added with this PR.
This PR also adds a slight guardrail when attempting to reproduce decoys
with a PyRosetta version that is not exactly identical to the version
used to produce the original decoy. Currently, `PyRosettaCluster` just
gives a warning if the PyRosetta versions are not identical between the
build that created the original decoy and the current build being
implemented for reproduction. In this PR, if there's a mismatch we raise
an error instead, and the user must manually input an empty string to
the `pyrosetta_build` keyword argument to bypass PyRosetta build
validation. This is similar logic as the conda environment validation
that is already in place.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Supporting a universal PyRosetta initialization file (#503)
This PR aims to add support for a PyRosetta initialization file type
(i.e., a `.init` file) for reproducible PyRosetta initialization.
What started out as a quick and dirty method to cache `.params` files
morphed into a universal file type for streamlining PyRosetta
initialization. This approach takes advantage of the
`ProtocolSettingsMetric` SimpleMetric in PyRosetta to cache Rosetta
command line options and parser script variables, and `zlib`, `base64`,
and `json` libraries for compressing/decompressing arbitrary input text
and binary files, including files within files (i.e., subfiles).
Subfiles are cached by doing a brute force search through all input
files by splitting file contents on spaces, which enables, for example,
caching conformer files on `PDB_ROTAMERS` lines within `.params` files,
and caching subfiles in list files passed in with the `-l` Rosetta
command line flag. Note that the PyRosetta database and any input
directories are not cached in the `.init` file.
The following new methods are supported in this PR:
- `pyrosetta.dump_init_file`: write a PyRosetta initialization `.init`
file
- `pyrosetta.init_from_file`: initialize PyRosetta from a `.init` file
- `pyrosetta.get_init_options`: get the currently initialized Rosetta
command line options
- `pyrosetta.get_init_options_from_file`: get PyRosetta initialization
options from a `.init` file
This is a work in progress and suggestions/recommendations are welcome.
Once merged, ideally the `.init` file format can remain stable over
incremental PyRosetta versions.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Co-authored-by: Sergey Lyskov <3302736+lyskov@users.noreply.github.com>
Supporting secure unpickling in PyRosetta (#523)
Currently, `PackedPose` objects are serialized/deserialized using the
`pickle` module (introduced in ~2019), and the `Pose.cache` dictionary
(introduced in #430) supports caching arbitrary datatypes in the `Pose`
object using the `pickle` module. Additionally, #462 enables saving
compressed `PackedPose` objects to disk (i.e., as `*.b64_pose` and
`*.pkl_pose` files) for sharing PyRosetta `Pose` objects with the
scientific community. However, use of the `pickle` module is not secure
(see warning [here](https://docs.python.org/3/library/pickle.html) as
outlined in #519).
Herein this PR, a secure `pickle.loads` method is developed and slotted
into the `PackedPose` and `Pose.cache` infrastructure to permanently
disallow certain risky packages, modules, and namespaces from being
unpickled/loaded (e.g., `exec`, `eval`, `os.system`, `subprocess.run`,
etc., and will be updated over time as needed), thus significantly
improving the security of handling `PackedPose` and `Pose` objects in
memory if received from a second party (i.e., over a socket, queue,
interprocess communication, etc.) or when reading a file received from a
second party (i.e., using `pyrosetta.distributed.io.pose_from_file` with
a `*.b64_pose` and `*.pkl_pose` file). By default, only `pyrosetta` and
`numpy` packages, and certain `builtins` modules (like `dict`,
`complex`, `tuple`, etc.), are considered secure and permitted to be
unpickled/loaded. Other packages that the user may want to
serialize/deserialize may be assigned as secure per-process by the user
in-code (see methods below). It is worth noting that PyTorch developers
have implemented a similar strategy with the
[torch.serialization.add_safe_globals()](https://docs.pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals)
method.
Another aim of this PR is to implement an optional Hash-based Message
Authentication Code (HMAC) key in the `Pose.cache` dictionary for data
integrity verification. While not a security feature, this new API
allows the user to set a HMAC key to be prepended to every score value
in the `Pose.cache` dictionary that effectively says "this was saved by
PyRosetta", so that it intentionally raises an error when the HMAC key
is missing or differs upon retrieval, indicating that the data appears
to have been tampered with or modified. By default, the HMAC key is
disabled (being set to `None`) in order to reduce memory overhead of the
`Pose.cache` dictionary; e.g., if 32 bytes are prepended to each score
value, with 1,000 score values that's 32,000 bytes or 32 KB of overhead,
and with a million score values that's 32 MB of overhead.
The following are newly added functions:
- `pyrosetta.secure_unpickle.add_secure_package`: Add a package to the
unpickle allowed list
- `pyrosetta.secure_unpickle.remove_secure_package`: Remove a package
from the unpickle allowed list
- `pyrosetta.secure_unpickle.clear_secure_packages`: Remove all packages
from the unpickle allowed list
- `pyrosetta.secure_unpickle.get_disallowed_packages`: Return all
permanently disallowed packages/modules/prefixes
- `pyrosetta.secure_unpickle.get_secure_packages`: Return all packages
in the unpickle allowed list
- `pyrosetta.secure_unpickle.set_secure_packages`: Set all packages in
the unpickle allowed list
- `pyrosetta.secure_unpickle.set_unpickle_hmac_key`: Set the HMAC key
for the `Pose.cache` dictionary
- `pyrosetta.secure_unpickle.get_unpickle_hmac_key`: Return the HMAC key
for the `Pose.cache` dictionary
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>