Minor change to PackedPose.scores dictionary construction (#612)
Currently in PyRosetta, instantiating a `PackedPose` object triggers
deserialization of the `Pose.cache` dictionary (due to the line
`self.scores = dict(pose_or_pack.cache)`). This is usually fine for
single Python processes where users have already added secure packages
to the unpickle-allowed list (i.e., via the `pyrosetta.secure_unpickle`
module); however, `PyRosettaCluster` spawns new threads using the
`billiard` package on Dask workers, where binary blobs are deserialized
into instantiated `PackedPose` objects, triggering deserialization of
the `Pose.cache` dictionary before users have had a chance to add secure
packages to the unpickle-allowed list in the `billiard` subprocess. This
can raise `UnpickleSecurityError` exceptions if users cache arbitrary
Python types in the `Pose.cache` dictionary, even when users add secure
packages in their PyRosetta protocols, since `PackedPose` instantiation
happens before the user-provided PyRosetta protocols run.
In order to remedy this, this PR updates the `PackedPose` class to
instantiate an empty dictionary on the `PackedPose.scores` attribute,
rather than a detached, flattened copy of the `Pose.cache` dictionary.
Not only does this free up duplicated memory usage from all of the
`Pose.cache` scores copied onto the `PackedPose` object, but I think it
improves the `PackedPose` API intention a bit: the `PackedPose` class
already implements the `PackedPose.update_scores` method, which is
specifically designed to embed new scores into the `Pose` object; and
because the `PackedPose.scores` attribute is currently a detached,
flattened copy of the `Pose.cache` dictionary, this can cause confusion
about what are the ground truth scores in the `PackedPose` object.
Regarding `PyRosettaCluster`, by preventing deserialization of the
`Pose.cache` dictionary upon `PackedPose` instantiation, users can now
programmatically add secure packages in their PyRosetta protocols per
Python process (as the `pyrosetta.secure_unpickle` module was designed),
and the underlying `PyRosettaCluster` infrastructure does not
unintentionally deserialize the `Pose.cache` dictionary just to compress
`PackedPose` objects in transit between the client and Dask workers.
Nonetheless, this minor change has the potential to break scripts in the
wild. Anyone's scripts that call `PackedPose.scores` to access score
values (that were originally embedded in the `Pose` object) now needs to
call `PackedPose.pose.cache` (or the deprecated
`PackedPose.pose.scores`). However, the `PackedPose.scores` attribute
still functions the same way as before as a detached dictionary from the
`Pose.cache` infrastructure, but herein this PR it is just not
pre-populated with the `Pose.cache` dictionary upon `PackedPose`
instantiation; users adding scores to the `PackedPose.scores` dictionary
can still access them there (even though the preferred syntax continues
to use the `PackedPose.update_scores` method).
---------
Co-authored-by: Sergey Lyskov <3302736+lyskov@users.noreply.github.com>
Update SecureUnpickler disallowed packages (#611)
This PR updates the `pyrosetta.secure_unpickle.SecureUnpickler` class to
block some additional callable targets via the `pickle` module,
including `numpy.load` and `pandas.read_pickle` modules. Unit tests
added herein demonstrate that secure `numpy`/`pandas` modules like
`numpy.array` and `pandas.DataFrame` are still deserializable.
Minor updates to uv project export machinery in PyRosettaCluster (#608)
This minor PR addresses:
- Instead of relying on an exported `requirements.txt` file from the uv
project, the `uv.lock` file is cached (similar to Pixi which caches the
`pixi.lock` file). The reason for this change is several-fold:
- The `requirements.txt` file does not pin the Python version being used
to execute the `PyRosettaCluster` simulation. We were relying on the
`pyproject.toml` file to provide this information, however, if the user
doesn't pin the Python version in it, then there's no record of which
Python version is being used (other than inspecting the PyRosetta build
string or another hack).
- The `uv export` command doesn't provide sha256 hashes for custom
registries. Now that `pyrosetta` is shipped quarterly, this disallows
using the `uv pip sync --require-hashes` flag which would match hashes
for each requirement, making reproduction more robust. While `uv pip
compile` does support the `--generate-hashes` flag, again the Python
version is not captured in a `requirements.txt` file.
- Currently, `uv export` doesn't support emitting the `--find-links`
paths into the exported `requirements.txt` file. While `uv pip compile`
does support the `--emit-find-links` flag, again the Python version is
not captured in a `requirements.txt` file.
- Ensures the `self.toml_format` attribute is still set for the edge
case where the uv `pyproject.toml` file cannot be automatically located
via either the `UV_PROJECT` environment variable or the current working
directory (which currently leads to an `AttributeError` upon saving
results). A more helpful warning message is provided herein.
- Disables two `FutureWarning`s regarding the `filter_results` and
`norm_task_options` class attributes being enabled by default, which
were introduced in version `2.1.0` (merged ~8/25) and `3.0.0` (merged
~10/25), respectively.
Updated PARCS applications and IMMS_CCS score function (#609)
This application builds on the existing PARCS (parcs_ccs_calc.cc)
application and the IMMS_CCS energy term originally developed by
smturzo.
I extended PARCS to support multimeric protein complexes, enabling
simulation of PARCS CCS data for input structures containing multiple
chains. In addition, the existing IMMS_CCS energy term, which was
previously limited to monomers, was generalized for complexes through
the introduction of a new IMMS_ComplexCCS_Energy term. I also
implemented a new CCS_IMMS_with_CryoEMEnergy score term that integrates
experimental CCS data with cryo-EM information.
Method: For PARCS multimer support, I introduced a boolean flag
(-multimer) to the existing PARCS application. When enabled, the
algorithm predicts CCS values for multimeric assemblies by
reparameterizing the original CCS calculation.
For the IMMS-based energy terms, I developed on the existing
CCS_IMMSEnergy implementation
(source/src/core/energy_methods/CCS_IMMSEnergy.cc/.hh) by adding new
energy classes:
* CCS_IMMSComplexEnergy, which enables CCS-based scoring for protein complexes
* CCS_IMMS_with_CryoEMEnergy, which incorporates cryo-EM restraints alongside experimental CCS data
Integration test: I did integration just like how it was done for
monomers with additional flag. for multimer The test passed.
Fix dropped settings issues in HighResDocker (#520)
The copy constructor of HighResDocker was not copying over the resfile_
member, which means it was ignoring that setting. Since the copy
constructor is effectively a straight member-by-member copy, we can
simply delete it and rely on the autogenerated copy constructor.
Additionally, I noticed that the initialize_from_options() function was
declaring local variables, rather than changing the member variables.
Fix this.
Supporting task retries in PyRosettaCluster (#605)
`PyRosettaCluster` supports running tasks on available compute
resources; however, often it's more economical to run tasks on
preemptible compute resources, such as cloud spot instances or backfill
queues. This PR exposes Dask's task retry API via the
`PyRosettaCluster.distribute` method, allowing configuration of the
number of automatic retries for each submitted task. When the `retries`
keyword argument parameter is set, `PyRosettaCluster` will reschedule
failed tasks up to the specified number of times if compute resources
are reclaimed midway through a protocol.
This PR also adds a logging warning if using the `resources` keyword
argument with `dask` version `<2.1.0`.
Preserve Pose bitwise representation when accessing Pose.cache SimpleMetrics data (#595)
This PR aims to make a minor change to the `Pose.cache` infrastructure.
Currently, accessing SimpleMetrics data in the `Pose.cache` dictionary
via the `get_sm_data` method lazily initializes a `SimpleMetricData`
entry in the `Pose.data` cache. If a `Pose` does not already contain
SimpleMetrics data, then merely accessing this getter method bitwise
mutates the `Pose` (albeit harmlessly, since it only instantiates an
empty container). However, in principle, a getter ought not to modify
the underlying state or binary representation of the `Pose`.
Herein, we instead conditionally check for
`Pose.data().has(CacheableDataType.SIMPLE_METRIC_DATA)`, performing the
same test at the PyRosetta layer that the `get_sm_data` method performs
at the C++ interface. If `False`, then we return an empty dictionary
without lazily initializing a `SimpleMetricData` entry in the
`Pose.data` cache. This minor update enables deterministic behavior when
instantiating `PackedPose` objects, during which the `Pose.cache`
dictionary is accessed to define the `PackedPose.scores` attribute. By
preventing lazy initialization of the SimpleMetrics data cache, the new
behavior is consistent regardless of whether input `Pose` objects
contain SimpleMetrics data. Unit tests are also added herein to ensure
consistency of the underlying bitwise representations of `Pose` objects
with and without SimpleMetrics data.
Add beta_jan25 energy function (#548)
The aim of this PR is to add `beta_jan25` to the `rosetta` source code. This is an updated version of `beta_nov16`.
We will shortly post a manuscript describing how we developed `beta_jan25`. They key updates are to the LJ potential. We identified steric clashing in proteins that were relaxed or designed using `beta_nov16`. We identified examples of this problem in a high-quality benchmark from the `dualoptE` protocol used to train the energy function. We then used this benchmark, and the others in `dualoptE`, to refit a small number of LJ parameters. The refitting largely eliminated the clashing problem, and `beta_jan25` is as good or better than `beta_nov16` when assessed on multiple benchmarks using validation data.
Frank also advised me that -gen_potential should stay the same (on top of beta_nov16), and we decided to have plain -beta invoke beta_jan25.
Restoring the RandomGenerator state with PyRosetta initialization files (#576)
This quick PR supports:
1. Caching Rosetta's `RandomGenerator` Mersenne Twister (MT19937)
internal state in a PyRosetta initialization file
2. Optionally restoring the MT19937 internal state when initializing
from a PyRosetta initialization file
This enables continuity of the PyRosetta session's MT19937 internal
state from the point at which the PyRosetta initialization file was
written, independent of whether the `RandomGenerator` seed was
explicitly configured.
Support task scheduling priorities in PyRosettaCluster (#571)
This PR adds support for finer control of task execution orchestration
in PyRosettaCluster by exposing Dask's work priority API controlling
Dask schedulers. There are two major task execution patterns that the
user may wish to follow when setting up a PyRosettaCluster simulation:
1. _Breadth-first task execution:_ Currently, tasks are always run
generally following a first-in, first-out (FIFO-like) task chain
behavior. This means that when the Dask worker resources are saturated
(which is a typical scenario), all submitted tasks have equal priority
and are front-loaded to the upstream user-defined PyRosetta protocols,
delaying execution of the downstream protocols until all tasks finish
the upstream protocols.
2. _Depth-first task execution:_ This PR enables task chains to run to
completion, by allowing the user to explicitly increase the priority of
tasks submitted to downstream user-defined PyRosetta protocols. This
means that when the Dask worker resources are saturated, once a task
finishes an upstream protocol, it is submitted to the next downstream
protocol with a higher priority than tasks still queued for the upstream
protocols, so task chains may run through all protocols to completion.
For example, to run user-defined PyRosetta protocols with depth-first
task execution, the `priorities` keyword argument is implemented in this
PR where higher priorities take precedence:
```
PyRosettaCluster(...).distribute(
protocols=[protocol_1, protocol_2],
priorities=[0, 10],
)
```
Say the user has 10,000 tasks and only 10 Dask worker threads to run on,
then with depth-first task execution, the process is as follows:
1. All 10,000 tasks are queued to run `protocol_1`
2. 10 tasks immediately are scheduled to run `protocol_1` on available
Dask worker resources
3. As the 10 tasks complete `protocol_1`, they immediately are scheduled
to run `protocol_2` before the other 9,990 tasks queued to run
`protocol_1` are scheduled
4. As those 10 tasks complete `protocol_2`, they are saved to disk, and
the next 10 tasks immediately are scheduled to run `protocol_1`
5. _Etc._
Note that in distributed cluster scenarios, tasks are scheduled on the
remote cluster _asynchronously_ from task submissions from the client,
so due to normal cluster-specific network latencies, even if a task's
priority is higher, there may be short delays in the Dask worker
receiving the task, leading to slightly nondeterministic behavior in
practice, but in general the task execution pattern follows the user's
priority specifications.