Back to Recipes Homepage
recipe picture
Recipe: print protein 2.9.1
Created by LociOiling 7 2
4.833335
Your rating: None Average: 4.8 (6 votes)

Profile

Name: print protein 2.9.1
ID: 103314
Created on: Wed, 03/18/2020 - 12:00
Updated on: Thu, 04/16/2020 - 20:33
Description:

Update of "print protein lua2 V0" by marie_s. Version 2.9 makes the recipe chain-aware and makes many small changes. Version 2.9.1 is a quick fix for proline at the N terminal.



Best For


Comments

LociOiling's picture
User offline. Last seen 4 hours 3 min ago. Offline
Joined: 12/27/2012
Groups: Beta Folders
now with chains

Version 2.9 includes detection of chains, which may be helpful in the coronavirus (CoV) puzzles and other complex puzzles.

The recipes AA Edit 2.0 and SS Edit 2.0 also detect chains using similar logic.

In the coronavirus puzzles to date, segments 1 to 117 are the CoV spike protein, which is identified as chain A in the chain-aware recipes. The remaining segments are the designable binder, identified as chain B.

The sequence information is now reported by chain, using letters, A, B, C, and so on.

Foldit still does everything by segment number, so segment 117 is the last segment of chain A in the CoV puzzles, and segment 118 is the first segment of chain B.

In scientific sources, each chain is numbered separately, and the term "residue" is used instead of segment. In the CoV, puzzles, chain A has residues 1-117, then chain B, the binder, starts again at residue 1.

In print protein, a chain may have two sets of "rulers" providing numbering, identifying segment versus residue numbering.

For the first chain, the numbers are the same, so there's just one ruler above the sequence as before. It's now labelled "segment/residue" at the end.

For the second and subsequent chains, the ruler above the starts at 1, and is labelled "residue". Then there's a second ruler underneath the sequence, labelled "segment" giving the Foldit segment number.

For example, the primary sequence of the B chain in puzzle 1811 might look like this:

chain B, type = protein, start = 118, end = 196, length = 79

primary structure sequence (single-letter amino acid codes for searching PDB)

         1         2         3         4         5         6         7        7
1234567890123456789012345678901234567890123456789012345678901234567890123456789  residue 1-79

dqdelkkemderlkewkkkfeelirkgtrqietivdkilhevwdvvyqyvvkddernkedlkkvekllkfvkevydrqk

8901234567890123456789012345678901234567890123456789012345678901234567890123456  segment 118-196
1 2         3         4         5         6         7         8         9     9
1 1         1         1         1         1         1         1         1     1

The primary and secondary structure are now reported by chain, but some of the other reports are still "global", covering all segments at once.

In addition to chain detection, print protein has some new electron density comparisons, as suggested by jeff101 long ago. The density reports now compare helixes versus non-helixes and sheets versus non-sheets. Not all of jeff101's suggested features were implemented, stay tuned on that.

There are also some changes in the main spreadsheet output. For puzzles with more than one chain, the chain id and residue number are reported in new columns.

The "backbone locked" and "sidechain locked" indicators are now two separate columns, where previously they shared one column.

There are also two new columns: "Subtotal", and "Unknown". "Subtotal" reports the total of all subscores for each segment. "Unknown" is the difference between the segment score (current.GetSegmentEnergyScore) and the total of all subscores.

In the CoV puzzle, there are locked segments with a non-zero segment score, but all-zero subscores. The "Unknown" column highlights these segments.

Chains and new columns aside, there are many other minor format changes, intended to make the scriptlog output more user-friendly. There is now a "line length" setting under the "More" button. The line length determines the length of lines printed with a ruler, such as the primary and secondary structure. (Longer lines are also printed without a ruler, for the sake of cut-and-paste.)

There are also a number of internal changes, such as using named members in some internal tables, and converting to method-style function calls in protNfo and other spots.

(edit: corrected what's new in the density reports)

LociOiling's picture
User offline. Last seen 4 hours 3 min ago. Offline
Joined: 12/27/2012
Groups: Beta Folders
full documentation

Version 2.9 of "print protein" now includes detecting chains. Primary and secondary structure, long with several related items, are now reported separately by chain. Chains have both residue and segment numbering, making for easier comparisons with external sources.

print protein overview

This version of "print protein" is based on the classic "print protein lua2 V0" by marie_s.

The recipe displays detailed scoring information, including the each segment's score and subscores. The subscores include categories like backbone, clashing, packing, hiding, and ideality.

The recipe also reports the protein's primary structure (amino acid sequence) and secondary structure (helixes, sheets, and loops).

The recipe offers copy-and-paste reporting for most of its key outputs. Complete output is also available in the recipe's scriptlog.

Thanks to spvincent, Timo van der Laan, and HerobrinesArmy for code and ideas. Thanks to brgreening for helping to illuminate the mystery of the total score.

protein information

At the start, the recipe gathers all available information from Foldit. This can take a while on large puzzles.

The recipe reports the overall segment count, and gives the details of any ligands found.

The recipe also looks for chains. Most puzzles contain only a single protein chain, but some have multiple chains of proteins or even chains of DNA or RNA. For reporting purposes, each type of chain and also each ligand is listed separately.

scoring information

The recipe detects active subscores using the logic found in "Tvdl enhanced DRW".

In some cases, this logic may suppress certain subscores, such as disulfides, when they have a low total value across all segments. The recipe reports the active subscores in the main dialog and the scriptlog.

The recipe calculates the "filter bonus" by toggling filters off and on, and then checks the total score. In theory, the total score is 8000 points plus the total of all segment subscores, plus the filter bonus. There's is usually a discrepancy, which is reported as "dark" score.

The recipe also reports on each individual filter, including whether it's in the "satisfied" condition, and also whether it awards a bonus, and the actual bonus/penalty if so.

The recipe also reports the Rosetta energy score scoreboard.GetScore, normally a negative number. The recipe converts the Rosetta score to a Foldit score using the formula "FolditScore = 10 * ( 800 - RosettaScore ). Again, there's normally a discrepancy between this converted score and the current score reported by the Foldit client.

sequence information

The recipe reports the primary sequence as a string of single-letter codes, and the secondary structure as a string with "H" for helix, "E" for sheet, "L" for loop, and "M" for molecule, indicating a ligand.

The recipe also uses single-letter codes for RNA or DNA bases.

Complex puzzles may include one or more ligands, multiple protein chains, and even DNA or RNA. Print protein treats each of these items are a separate "chain", reporting its sequence information separately. The recipe assigns chains the identifiers A, B, C, and so on, similar to the scheme found in the PDB.

In the scriptlog, the sequence and secondary structure information is reported both as single strings, and as fixed-length lines with rulers. The single strings are for copy-and-paste into other tools, while the rulers make it easier to find a specific segment. In complex cases, there may be two rulers, with the ruler above giving one-based "residue" numbers, specific to a chain, while the ruler below gives "segment" numbers, which are continuous across all chains.

The recipe also makes the primary sequence and secondary structure are available in a copy-and-paste dialog. Each chain has its own set of copy-and-paste fields.

The recipe issues warning messages to the scriptlog if a non-standard amino acid code or secondary structure code is found. The code "x" is substituted for a non-standard amino acid code. Ligands are represented by code "x".

The recipe also reports hydrophobicity as a string with "i" for if hydrophobic, and "e" if not hydrophobic.

Locked segments are reported as single-character codes - "U" for unlocked, and "L" for locked. There are separate strings for locked backbone and locked sidechains. The same information is also presented in other sections.

modifiable sections

The recipe reports on modifiable sections, including locked and unlocked sections, zero-score sections, and mutable sections. The "mutable segments" report is now optional.

The modifiable sections are reported in Lua table format. Although readable by humans, this table-formatted output can be copied into a recipe if desired. The format compatible with the "segment set and list" module found in TvdL Enhanced DRW and related recipes. Specifically, it uses the "set" format, where each entry in the table gives a starting and ending segment. (Each entry is in itself a table....)

Some puzzles have locked sections with movable sidechains or locked sections that are mutable. Some recipes incorrectly assume that "locked" means not modifiable in any way.

main dialog and segment subscore report

The recipe displays a main dialog before the segment subscore report is produced. Along with reporting other information, the dialog lets you select which subscores are to be included in the report. The mini contact table and detailed mutable reports are also optionally available, as are density analysis reports for Electron Density puzzles.

The main dialog has a "more" button, which displays less frequently used options. The hydropathy index (a fixed value based on the AA code), atom count, and rotamer count can optionally be included, along with the long names and abbreviations for the amino acid or RNA/DNA base. You can select the delimiter character, with the tab character as the default. The number of decimal places reported is also adjustable. The "line length" setting in the "more" dialog controls the maximum length of the sequence information, or anything reported with a ruler.

The segment subscore report available in a cut-and-paste dialog, or in the scriptlog. The report includes a "Subtotal" column, with the total of the subscores for each segment, and an "Unknown" column, showing the difference between the segment's score and the total of its subscores. The subscore report also includes a total line reflecting the column totals for each of the scoring components.

density analysis

For Electron Density puzzles, the recipe offers density analysis, which appears as a default option on puzzles with a density component.

The first section of density analysis looks at density by amino acid type. Some amino acids outscore others. For example, tyrosine might average a density subscore near 50, but glycine might have average density under 20. The "density by AA" section lists each amino acid found in the puzzle, the number of segments with that AA, the total density score of those segments, and the mean density for that AA. It also lists the worst density score and the corresponding segment number, and best density score and segment number.

The next sections are similar, but show the density component for "aromatics" (rings) versus non-aromatics, aliphatics versus non-aliphatics, hydrophobics versus hydrophilics, helix versus non-helix, and sheet versus non-sheet.

Aromatic AAs typically have a much higher density score than non-aromatics. Aliphatics typically score lower than non-aliphatics. Hydrophobics and hydrophilics are close, with hydrophobics typically scoring a bit better.

The aromatics are "f" phenylalanine, "h" histidine, "w" tryptophan, and "y" tyrosine.

For this recipe, aliphatics are "v" valine, "l" leucine, and "i" isoleucine. (Not included: "g" glycine and "a" alanine.)

The first six sections of the density analysis are output in spreadsheet-ready format, similar to the main segment report.

The final section is the "density deviation" report. For each segment, this section shows a "+" if the density subscore is higher than the average for that AA, a "-" if lower, and an "=" if the density subscore is close to the mean.

The density deviation report looks something like this:

"density deviation (above/below mean density by AA)"
1234567890123456789012345678901234567890123456
-++-++-+++-+=+-+---+++=+++++++++++-+----+-=---

The density deviation section is intended to provide a quick indication of which sections are scoring best in terms of density.

The density analysis items are available for copy-and-paste, and can also be retrieved from the scriptlog.

cut-and-paste dialogs

When you click "OK" in the main dialog, the segment subscore report and other selected reports are produced. The cut-and-paste dialog then appears, with text boxes for the subscore report, and the primary and secondary structures of the protein. These fields can be copied and pasted into a spreadsheet or another tool.

LociOiling's picture
User offline. Last seen 4 hours 3 min ago. Offline
Joined: 12/27/2012
Groups: Beta Folders
Version 2.0.1 quick fix

Version 2.0.1 fixes chain detection when proline is at the N terminal. In this case, the atom count is 18, instead of 15 when proline is mid-chain. Usually the atom count increases by 2 at the N terminal.

Want to try?
Add to Cookbook!
To download recipes to your cookbook, you need to have the game client running.
Parent
Children

none

Authors
Sitemap

Developed by: UW Center for Game Science, UW Institute for Protein Design, Northeastern University, Vanderbilt University Meiler Lab, UC Davis
Supported by: DARPA, NSF, NIH, HHMI, Amazon, Microsoft, Adobe, RosettaCommons