Add beta_jan25 energy function (#548)
The aim of this PR is to add `beta_jan25` to the `rosetta` source code. This is an updated version of `beta_nov16`.
We will shortly post a manuscript describing how we developed `beta_jan25`. They key updates are to the LJ potential. We identified steric clashing in proteins that were relaxed or designed using `beta_nov16`. We identified examples of this problem in a high-quality benchmark from the `dualoptE` protocol used to train the energy function. We then used this benchmark, and the others in `dualoptE`, to refit a small number of LJ parameters. The refitting largely eliminated the clashing problem, and `beta_jan25` is as good or better than `beta_nov16` when assessed on multiple benchmarks using validation data.
Frank also advised me that -gen_potential should stay the same (on top of beta_nov16), and we decided to have plain -beta invoke beta_jan25.
Moving the TestEnvironmentReproducibility unit test, and updating environment export commands (#567)
The aim of this PR is to:
- Update the pixi environment export command to save the `pixi.lock`
file, rather than the generic YAML file export which does not pin
package versions necessary for environment reproducibility.
- Add support for the pixi `--manifest-path` flag and the uv `--project`
flag, so that `PyRosettaCluster` simulations can be run from outside of
the pixi manifest or uv project directories.
- Add support to cache the pixi manifest file and uv `pyproject.toml`
file to PyRosettaCluster results, which are necessary to reproduce the
pixi/uv project in conjunction with the cached `pixi.lock` file or
`requirements.txt` file.
- Sanitize conda channels of username/password credentials instead of
stripping out the entire URL from the exported environment file, since
PyRosetta no longer requires a username/password for installatioin,
making the simulation reproduction lifecycle simpler by obviating the
need to reconfigure the conda channel when recreating a virtual
environment.
- Update warning/exception messages for pixi, uv, and mamba, instead of
just expecting conda.
- Move the `pyrosetta.distributed.cluster.recreate_environment` function
to the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository. The reason is that we encountered the "chicken or the
egg" problem, where the user must install PyRosetta (typically through
an environment manager like pixi, uv, conda, or mamba) in order to gain
access to the `recreate_environment` function, but then the
`recreate_environment` function cannot be run from inside a virtual
environment, since inside the function it calls `pixi install ...` and
`uv venv ...` which can implement global interpreter locks that hang
indefinitely because the outer environment is already activated.
Therefore, the code needs to be run from the system python interpreter,
which requires the `recreate_environment` function to exist in a
separate repository, as it is not functional in the current state.
- Move the
`pyrosetta.tests.distributed.cluster.test_reproducibility.TestEnvironmentReproducibility`
unit tests (and supporting modules) into the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository. These unit tests do not currently run on the
Benchmark server (due to its implementation of a pure virtual
environment that does not install `conda`, `mamba`, `uv`, or `pixi`) and
were originally added to test the `recreate_environment` function from
the
[RosettaCommons/pyrosetta-extras](https://github.com/RosettaCommons/pyrosetta-extras)
GitHub repository (however that presented a second "chicken or the egg"
problem which will be solved in a separate PR for that repository).
The following is unrelated, but updated while here:
- Add a more helpful exception when using `LocalCluster(security=True)`
and the `cryptography` package is not installed.
Add uv/conda/mamba unit tests for PyRosettaCluster virtual environment recreation (#560)
This PR is a follow-up to PR #536 to add unit tests for `uv`, `conda`,
and `mamba` workflows in the PyRosettaCluster simulation reproduction
lifecycle. In particular, this PR changes the
`pyrosetta.distributed.cluster.recreate_environment` function to support
recreating a `conda`/`mamba` virtual environment in a local prefix
directory rather than via a provided environment name to the global
`conda`/`mamba` environment context, which mimics the `uv` and `pixi`
workflows more closely.
Adding dask security to PyRosettaCluster (#531)
A primary feature of `PyRosettaCluster` is that arbitrary user-provided
PyRosetta protocols are pickled, sent over a network, and unpickled,
which allows the user to run customized macromolecular design and
modeling workflows. If the user is operating `PyRosettaCluster` behind a
trusted private network segment (i.e., a firewall), the current
implementation is already secure from external threats (such as
eavesdropping, tampering or impersonation). However, in cases of running
`PyRosettaCluster` without a truly isolated and trusted environment, the
`dask` library can be configured to use TLS/SSL communication between
network endpoints for authenticated and encrypted transmission of data.
This PR aims to integrate Dask's TLS/SSL communication into
`PyRosettaCluster`, as well as implement a few additional security
measures:
1. Adds a `security` keyword argument to `PyRosettaCluster`, which can
accept a `dask.distributed.Security()` object. Alternatively, it accepts
a `bool` object, where if `True` we use the `cryptography` package
through the `dask` and `dask-jobqueue` APIs to generate a temporary
`dask.distributed.Security()` object for the simulation. Because
`PyRosettaCluster` supports remote dask worker instantiation via the
`dask-jobqueue` module, security is now enabled by default for the use
of remote clusters (such as `SLURMCluster`), and thus this PR adds
[cryptography](https://pypi.org/project/cryptography/) as a required
package for the `pyrosetta.distributed` framework (note that there are
very few `cryptography` dependencies, only including `cffi`, and
`openssl` which already ships with standard Python installations).
2. Adds a `pyrosetta.distributed.cluster.generate_dask_tls_security()`
function, which uses the OpenSSL executable that ships with standard
Python installations (due to the native python `ssl` library) to
generate a pre-configured `dask.distributed.Security()` object with the
necessary key/certificate pairs.
3. Enables Hash-based Message Authentication Code (HMAC)-SHA256
verification of `cloudpickle`d data (including the arbitrary
user-provided PyRosetta protocols and task `kwargs`) between network
endpoints (including the host node process, each dask worker process,
and the `billiard` subprocesses; i.e., client ↔ worker, client ↔
subprocess), where the cryptographic pseudo-random key is sent to dask
workers out-of-band using a dask worker plugin.
4. Adds nonce caching on the host node process and all worker processes
if security is disabled, with a `max_nonce` keyword argument that allows
setting the maximum nonce cache size in each process. Nonces are unique
keys added to each distributed message over the network (see the
`cryptography` package
[Glossary](https://cryptography.io/en/latest/glossary/) for more
information), where if the same nonce is encountered twice in the nonce
cache, it may indicate a replay attack and the simulation is
intentionally terminated for security reasons. Note that nonce caching
is disabled if dask security is already enabled, since the nonce caches
may add several additional MB of memory per process (which is not much).
Integrating PyRosetta initialization files into PyRosettaCluster (#511)
The purpose of this PR is to support several new features:
1. Adds an `output_init_file` instance attribute to `PyRosettaCluster`,
enabling dumping of a `.init` or `.init.bz2` file upon instantiation.
2. Adds `author`/`email`/`license` instance attributes to
`PyRosettaCluster`, which are cached in the `.init` or `.init.bz2` file
and output decoy and scorefile metadata.
3. Enables the `input_file` keyword argument of the
`pyrosetta.distributed.cluster.reproduce` method to accept a `.init` or
`.init.bz2` file that initializes PyRosetta before simulation
reproduction.
- Also adds a `skip_corrections` keyword argument to enable skipping
ScoreFunction corrections so that the reproduced results may be used for
successive reproductions.
4. Adds a `pyrosetta.distributed.cluster.export_init_file` function that
enables exporting an output decoy (in `.pdb`, `.pdb.bz2`, `.b64_pose`,
`.b64_pose.bz2`, `.pkl_pose`, `.pkl_pose.bz2` format) to a `.init` or
`.init.bz2` file format.
5. Adds a `norm_init_options` instance attribute to `PyRosettaCluster`,
enabling normalization of the task's PyRosetta initialization options.
This optional convenience feature takes advantage of the
`pyrosetta.get_init_options` method to update the `options` and
`extra_options` keyword arguments of each task after PyRosetta
initialization in the `billiard` subprocess on the dask workers, which
expands option names and relativizes any input files and directories to
the `billiard` subprocess current working directory. Relativized paths
are ideal for reproduction of simulations by a second party on a
different filesystem.
6. Adds `pyrosetta.distributed.io.read_init_file` and
`pyrosetta.distributed.io.init_from_file` functions, which handle
`.init` and `.init.bz2` files.
Please note that this PR also depends on PR #462 supporting a
`.b64_pose` and `.pkl_pose` file outputs in `PyRosettaCluster`. The
impetus for supporting a `.init` file in the `PyRosettaCluster`
simulation reproduction life cycle is that loading a `.b64_pose` or
`.pkl_pose` file into memory requires that PyRosetta is initialized with
the same residue type set as that used to save the `.b64_pose` or
`.pkl_pose` file (otherwise PyRosetta doesn't know how to reconstruct
the `Pose`, resulting in a segfault). Effectively, a user remains locked
out of `.b64_pose` and `.pkl_pose` files unless PyRosetta is initialized
correctly, which can be easily accomplished by PyRosetta initialization
with a `.init` or `.init.bz2` file. Hence, if a user decides to output
results in `.b64_pose` or `.pkl_pose` format, the `.init` file can then
be used to initialize PyRosetta identically and load the `.b64_pose` or
`.pkl_pose` file into memory.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Supporting a universal PyRosetta initialization file (#503)
This PR aims to add support for a PyRosetta initialization file type
(i.e., a `.init` file) for reproducible PyRosetta initialization.
What started out as a quick and dirty method to cache `.params` files
morphed into a universal file type for streamlining PyRosetta
initialization. This approach takes advantage of the
`ProtocolSettingsMetric` SimpleMetric in PyRosetta to cache Rosetta
command line options and parser script variables, and `zlib`, `base64`,
and `json` libraries for compressing/decompressing arbitrary input text
and binary files, including files within files (i.e., subfiles).
Subfiles are cached by doing a brute force search through all input
files by splitting file contents on spaces, which enables, for example,
caching conformer files on `PDB_ROTAMERS` lines within `.params` files,
and caching subfiles in list files passed in with the `-l` Rosetta
command line flag. Note that the PyRosetta database and any input
directories are not cached in the `.init` file.
The following new methods are supported in this PR:
- `pyrosetta.dump_init_file`: write a PyRosetta initialization `.init`
file
- `pyrosetta.init_from_file`: initialize PyRosetta from a `.init` file
- `pyrosetta.get_init_options`: get the currently initialized Rosetta
command line options
- `pyrosetta.get_init_options_from_file`: get PyRosetta initialization
options from a `.init` file
This is a work in progress and suggestions/recommendations are welcome.
Once merged, ideally the `.init` file format can remain stable over
incremental PyRosetta versions.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Co-authored-by: Sergey Lyskov <3302736+lyskov@users.noreply.github.com>
Add the deterministic_flag option to the ProteinMPNNProbabilitiesMetric SimpleMetric (#485)
This PR adds the `deterministic_flag` option to the `ProteinMPNNProbabilitiesMetric` PerResidueProbabilitiesMetric. After some evidence provided in #429, the `ProteinMPNNMover` already has the `deterministic_flag` option set to `false` by default, and therefore due to Torch randomness the `ProteinMPNNProbabilitiesMetric` can return non-deterministic PSSM tables, even with the `-run:constant_seed 1` Rosetta option enabled. This PR just adds some control of the `ProteinMPNNMover` from the `ProteinMPNNProbabilitiesMetric` to enable deterministic mode when calculating per-residue ProteinMPNN probabilities.
Add option to control intermediate dumping in BackrubProtocol (#372)
Especially with XML (where the Backrub protocol is part of a more extensive protocol), you don't necessarily want an unconditional dumping of the last & low results from the Backrub stage. This commit adds an XML option which will allow the dumping of the poses (old behavior), but turns it off by default. This only changes XML usage -- I didn't change how the backrub application behaves.
Changes to the backrub_interface_ddG integration test is expected.