Fixing FavorSequenceProfile to accept reference_name (#6407)
* relatively surface-level changes to parse_my_tag and xml_schema to enable functionality already present.
* Add reference_name to xml_schema
* adding an integration test for FavorSequenceProfile inputs. Removing extra reference_name definition from provide_xml_schema()
Merge pull request #6352 from RosettaCommons/roccomoretti/quick_restyping
Speed PDB loading by adding a Quick-and-Dirty ResidueTyping option.
One of the major contributors to the speed of PDB loading is figuring out the ResidueTypes to use. PR #5659 fixes this somewhat, but it still contributes non-trivially.
For most PDBs (e.g. simple all-protein ones), figuring out the ResidueTypes is straightforward. As such, I've implemented an alternative ResidueTyping scheme which can be enabled with the new command line option -fast_restyping (and the corresponding option on StructFileReaderOptions.
The way it works is to assume that the three letter code in the PDB is equivalent to the full type name. This should work for the canonical amino acids and -extra_res_fa ligands. To support more inputs, there's some epicycles added. The primary one is a fix-up for terminus patching. There's also some special casing for HIS/HIS_D calling, as well as D-aa/DNA/RNA/VRT. We also use the HETNAM specification if that's helpful, as well as falling back to the chemical components dictionaries for most everything else. -- The HETNAM records go part of the way towards round-tripping (that is, being able to read any Rosetta-outputted PDB properly with the flag on), but are insufficient. Adding full ResidueType name annotations in the output would be necessary for full support, and that's a potential future direction if this flag seems useful for people.
This approach is far from complete. In particular, most patching which happens due to the presence of atom names is missed (deliberately so). This is particularly an issue with carbohydrate-containing residues. The option is definitely not recommended for general use, though if you have "simple" PDBs (non-modified proteins, mostly), it should hopefully work for you. Caveat emptor, though.
In my test set, the time needed for ResidueType loading (which takes ~33% of the total runtime with current master) is reduced by a factor of 10, and most of the remaining portion of that is actually CCD residue type loading or ResidueTypeFinder time.
Merge pull request #6343 from RosettaCommons/roccomoretti/avoid_atomtree_updates
Speed up PDB loading by deferring AtomTree updates.
Benchmarking indicates that ~10-15% of the time for PDB loading is due to AtomTree::update_sequence_numbering(), which is called for each residue addition.
This is completely unnecessary, as we can just call setup_atom_tree() after we're all done adding the residues. To enable this we create a new Conformation function which takes a list of residues to append and does the addition all at once.
A quick test shows we do save ~10-15% of the time with this approach, and as far as I can tell we don't lose anything by the rearrangement.
Merge pull request #5659 from RosettaCommons/roccomoretti/speed_up_scoring
Speed up PDB loading by improving the ResidueTypeFinder.
Rosetta 3.13 is about 30% slower than Rosetta 3.12 when doing a plain rescoring of a large number of PDBs. I tracked that down to PDB file loading, specifically extra time in the ResidueTypeFinder, in large part due to patching.
I was able to rearrange some things to reduce wasted effort and make things more efficient. I my hands, we're now about twice as fast as Rosetta 3.13 when doing plain re-scoring.
It still looks like the ResidueTypeFinder is a substantial factor in the runtime (as measured by perf), but not necessarily in an easily fixable way.
Merge pull request #6401 from RosettaCommons/roccomoretti/fix_transform
Fix multi-repeat sampling in Transform mover.
An edit to the Transform mover meant that the best_ligand (which is eventually output) was being reset across repeats, rather than being accumulated across all repeats. This meant that having a repeat setting other than 1 (which is thankfully the default) was meaningless, as all the n-1 repeats before the last one didn't affect the output structure.
A minor adjustment means that we can continue to accumulate the best_ligand across all the repeats.
Merge pull request #6390 from RosettaCommons/roccomoretti/iterative_kinematic_clone
Convert kinematics::tree::Atom::clone() to use a non-recursive algorithm
The kinematics::tree::Atom::clone() is currently a recursive one, which looks to potentially have issues with stack size limits on certain machines for large proteins. (Well, possibly. That's the best tea-leaf reading we have for some issues we see on Foldit with M1 Macs.)
It looks rather straightforward to hold the "todo" list on the heap, and handle the parent/child reassignment non-recursively. This should avoid ultra-deep program stacks, and does not seem to affect program runtime.