Revisions №637

branch: benchmark 「№637」
Commited by: Rocco Moretti
GitHub commit link: 「be92069389921d63」「№6352」
Difference from previous tested commit: code diff
Commit date: 2023-06-12 14:30:06

Merge pull request #6352 from RosettaCommons/roccomoretti/quick_restyping Speed PDB loading by adding a Quick-and-Dirty ResidueTyping option. One of the major contributors to the speed of PDB loading is figuring out the ResidueTypes to use. PR #5659 fixes this somewhat, but it still contributes non-trivially. For most PDBs (e.g. simple all-protein ones), figuring out the ResidueTypes is straightforward. As such, I've implemented an alternative ResidueTyping scheme which can be enabled with the new command line option -fast_restyping (and the corresponding option on StructFileReaderOptions. The way it works is to assume that the three letter code in the PDB is equivalent to the full type name. This should work for the canonical amino acids and -extra_res_fa ligands. To support more inputs, there's some epicycles added. The primary one is a fix-up for terminus patching. There's also some special casing for HIS/HIS_D calling, as well as D-aa/DNA/RNA/VRT. We also use the HETNAM specification if that's helpful, as well as falling back to the chemical components dictionaries for most everything else. -- The HETNAM records go part of the way towards round-tripping (that is, being able to read any Rosetta-outputted PDB properly with the flag on), but are insufficient. Adding full ResidueType name annotations in the output would be necessary for full support, and that's a potential future direction if this flag seems useful for people. This approach is far from complete. In particular, most patching which happens due to the presence of atom names is missed (deliberately so). This is particularly an issue with carbohydrate-containing residues. The option is definitely not recommended for general use, though if you have "simple" PDBs (non-modified proteins, mostly), it should hopefully work for you. Caveat emptor, though. In my test set, the time needed for ResidueType loading (which takes ~33% of the total runtime with current master) is reduced by a factor of 10, and most of the remaining portion of that is actually CCD residue type loading or ResidueTypeFinder time.

Summary

...