Adding dask security to PyRosettaCluster (#531)
A primary feature of `PyRosettaCluster` is that arbitrary user-provided
PyRosetta protocols are pickled, sent over a network, and unpickled,
which allows the user to run customized macromolecular design and
modeling workflows. If the user is operating `PyRosettaCluster` behind a
trusted private network segment (i.e., a firewall), the current
implementation is already secure from external threats (such as
eavesdropping, tampering or impersonation). However, in cases of running
`PyRosettaCluster` without a truly isolated and trusted environment, the
`dask` library can be configured to use TLS/SSL communication between
network endpoints for authenticated and encrypted transmission of data.
This PR aims to integrate Dask's TLS/SSL communication into
`PyRosettaCluster`, as well as implement a few additional security
measures:
1. Adds a `security` keyword argument to `PyRosettaCluster`, which can
accept a `dask.distributed.Security()` object. Alternatively, it accepts
a `bool` object, where if `True` we use the `cryptography` package
through the `dask` and `dask-jobqueue` APIs to generate a temporary
`dask.distributed.Security()` object for the simulation. Because
`PyRosettaCluster` supports remote dask worker instantiation via the
`dask-jobqueue` module, security is now enabled by default for the use
of remote clusters (such as `SLURMCluster`), and thus this PR adds
[cryptography](https://pypi.org/project/cryptography/) as a required
package for the `pyrosetta.distributed` framework (note that there are
very few `cryptography` dependencies, only including `cffi`, and
`openssl` which already ships with standard Python installations).
2. Adds a `pyrosetta.distributed.cluster.generate_dask_tls_security()`
function, which uses the OpenSSL executable that ships with standard
Python installations (due to the native python `ssl` library) to
generate a pre-configured `dask.distributed.Security()` object with the
necessary key/certificate pairs.
3. Enables Hash-based Message Authentication Code (HMAC)-SHA256
verification of `cloudpickle`d data (including the arbitrary
user-provided PyRosetta protocols and task `kwargs`) between network
endpoints (including the host node process, each dask worker process,
and the `billiard` subprocesses; i.e., client ↔ worker, client ↔
subprocess), where the cryptographic pseudo-random key is sent to dask
workers out-of-band using a dask worker plugin.
4. Adds nonce caching on the host node process and all worker processes
if security is disabled, with a `max_nonce` keyword argument that allows
setting the maximum nonce cache size in each process. Nonces are unique
keys added to each distributed message over the network (see the
`cryptography` package
[Glossary](https://cryptography.io/en/latest/glossary/) for more
information), where if the same nonce is encountered twice in the nonce
cache, it may indicate a replay attack and the simulation is
intentionally terminated for security reasons. Note that nonce caching
is disabled if dask security is already enabled, since the nonce caches
may add several additional MB of memory per process (which is not much).
Integrating PyRosetta initialization files into PyRosettaCluster (#511)
The purpose of this PR is to support several new features:
1. Adds an `output_init_file` instance attribute to `PyRosettaCluster`,
enabling dumping of a `.init` or `.init.bz2` file upon instantiation.
2. Adds `author`/`email`/`license` instance attributes to
`PyRosettaCluster`, which are cached in the `.init` or `.init.bz2` file
and output decoy and scorefile metadata.
3. Enables the `input_file` keyword argument of the
`pyrosetta.distributed.cluster.reproduce` method to accept a `.init` or
`.init.bz2` file that initializes PyRosetta before simulation
reproduction.
- Also adds a `skip_corrections` keyword argument to enable skipping
ScoreFunction corrections so that the reproduced results may be used for
successive reproductions.
4. Adds a `pyrosetta.distributed.cluster.export_init_file` function that
enables exporting an output decoy (in `.pdb`, `.pdb.bz2`, `.b64_pose`,
`.b64_pose.bz2`, `.pkl_pose`, `.pkl_pose.bz2` format) to a `.init` or
`.init.bz2` file format.
5. Adds a `norm_init_options` instance attribute to `PyRosettaCluster`,
enabling normalization of the task's PyRosetta initialization options.
This optional convenience feature takes advantage of the
`pyrosetta.get_init_options` method to update the `options` and
`extra_options` keyword arguments of each task after PyRosetta
initialization in the `billiard` subprocess on the dask workers, which
expands option names and relativizes any input files and directories to
the `billiard` subprocess current working directory. Relativized paths
are ideal for reproduction of simulations by a second party on a
different filesystem.
6. Adds `pyrosetta.distributed.io.read_init_file` and
`pyrosetta.distributed.io.init_from_file` functions, which handle
`.init` and `.init.bz2` files.
Please note that this PR also depends on PR #462 supporting a
`.b64_pose` and `.pkl_pose` file outputs in `PyRosettaCluster`. The
impetus for supporting a `.init` file in the `PyRosettaCluster`
simulation reproduction life cycle is that loading a `.b64_pose` or
`.pkl_pose` file into memory requires that PyRosetta is initialized with
the same residue type set as that used to save the `.b64_pose` or
`.pkl_pose` file (otherwise PyRosetta doesn't know how to reconstruct
the `Pose`, resulting in a segfault). Effectively, a user remains locked
out of `.b64_pose` and `.pkl_pose` files unless PyRosetta is initialized
correctly, which can be easily accomplished by PyRosetta initialization
with a `.init` or `.init.bz2` file. Hence, if a user decides to output
results in `.b64_pose` or `.pkl_pose` format, the `.init` file can then
be used to initialize PyRosetta identically and load the `.b64_pose` or
`.pkl_pose` file into memory.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Supporting a universal PyRosetta initialization file (#503)
This PR aims to add support for a PyRosetta initialization file type
(i.e., a `.init` file) for reproducible PyRosetta initialization.
What started out as a quick and dirty method to cache `.params` files
morphed into a universal file type for streamlining PyRosetta
initialization. This approach takes advantage of the
`ProtocolSettingsMetric` SimpleMetric in PyRosetta to cache Rosetta
command line options and parser script variables, and `zlib`, `base64`,
and `json` libraries for compressing/decompressing arbitrary input text
and binary files, including files within files (i.e., subfiles).
Subfiles are cached by doing a brute force search through all input
files by splitting file contents on spaces, which enables, for example,
caching conformer files on `PDB_ROTAMERS` lines within `.params` files,
and caching subfiles in list files passed in with the `-l` Rosetta
command line flag. Note that the PyRosetta database and any input
directories are not cached in the `.init` file.
The following new methods are supported in this PR:
- `pyrosetta.dump_init_file`: write a PyRosetta initialization `.init`
file
- `pyrosetta.init_from_file`: initialize PyRosetta from a `.init` file
- `pyrosetta.get_init_options`: get the currently initialized Rosetta
command line options
- `pyrosetta.get_init_options_from_file`: get PyRosetta initialization
options from a `.init` file
This is a work in progress and suggestions/recommendations are welcome.
Once merged, ideally the `.init` file format can remain stable over
incremental PyRosetta versions.
---------
Co-authored-by: Rachel Clune <rachel.clune@omsf.io>
Co-authored-by: Sergey Lyskov <3302736+lyskov@users.noreply.github.com>
Add the deterministic_flag option to the ProteinMPNNProbabilitiesMetric SimpleMetric (#485)
This PR adds the `deterministic_flag` option to the `ProteinMPNNProbabilitiesMetric` PerResidueProbabilitiesMetric. After some evidence provided in #429, the `ProteinMPNNMover` already has the `deterministic_flag` option set to `false` by default, and therefore due to Torch randomness the `ProteinMPNNProbabilitiesMetric` can return non-deterministic PSSM tables, even with the `-run:constant_seed 1` Rosetta option enabled. This PR just adds some control of the `ProteinMPNNMover` from the `ProteinMPNNProbabilitiesMetric` to enable deterministic mode when calculating per-residue ProteinMPNN probabilities.
Add option to control intermediate dumping in BackrubProtocol (#372)
Especially with XML (where the Backrub protocol is part of a more extensive protocol), you don't necessarily want an unconditional dumping of the last & low results from the Backrub stage. This commit adds an XML option which will allow the dumping of the poses (old behavior), but turns it off by default. This only changes XML usage -- I didn't change how the backrub application behaves.
Changes to the backrub_interface_ddG integration test is expected.
Fixes for failing tests in main (#426)
Address some currently failing tests:
* Update the documentation repo with autogenerated data (and update the submodule SHA1)
* Fix clang tidy issues from the AMRLD code
* Make log file parsing in peptide_pnear_vs_ic50 scientific test more robust
* Fix compilation issues with Clang 19 (Issue #404)
Add cmath includes (#277)
`std::abs(Real)` needs the real-valued version of std::abs() function, which is found in the cmath header. (Versus the integer-valued versions, which can also be found in the cstdlib header.)
We were missing those imports in a few locations, which was causing my compiler to complain.
Rosetta automated reaction-based ligand design (RosettaAMRLD) (#403)
This is code for the new application RosettaAMRLD.
The framework is based on the Monte Carlo Drug Design branch (PR #282) by Rocco Moretti (@roccomoretti)
The algorithm use Monte Carlo Metropolis methods and similarity-guided sampling to iteratively optimize molecules within a combinatorial library.
The changes are primarily new Filters and Chemistries added to `src/protocols/drug_design/` and modifications to the DrugDesignMover for RosettaAMRLD related functionalities and Simulated Annealing options.
Other changes include:
* Added a python script that plot the distribution of (RDKit) metrics (by RM)
* Added an edge case handling when removing charges in RDKit utils (by RM)
* Modified the recompute function in `source/src/core/pose/metrics/simple_calculators/InterfaceDeltaEnergeticsCalculator.cc` to handle multiple chains. (calculate energies between the first and second chains -> between all chains and the last chain)
---------
Co-authored-by: Rocco Moretti <rmorettiase@gmail.com>
Rosetta Evolutionary Ligand (REvoLd) (#303)
This PR aims to include the newly developed application REvoLd into Rosetta.
REvoLd preprint: https://arxiv.org/abs/2404.17329
The algorithm optimizes small molecule ligands within combinatorial libraries like Enamine REAL. Most code changes are self enclosed and do not affect any functionality outside of the application. The only exception are changes to cmake compile directives for mpi which were broken.
A guide on how to run REvoLd is available here: https://docs.rosettacommons.org/docs/wiki/revold
Unit and integration test run locally and passed all.
---------
Co-authored-by: Paul Eisenhuth <eisenhuth@informatik.uni-leipzig.de>