About SS predictions

Started by BitSpawn

BitSpawn Lv 1


I am writting a C++ program (and I hope it will be a script) to find the secondary structure for a sequence.
I am using NR database (2GB), I have converted it in two dictionaries with segs grouped by its SS: H or E (size 5MB).

The program builds a sorted b-tree to search sections of the sequence and it lets n configurable mutations.
The reason for doing this program is that we use more sheets than there are in real proteins, so I wanted to do three things:

  • add Foldit designed proteins
  • increase/decrease the weight of the sheets/helix.
  • write it in Lua with options for players

I have finished a first approximation, and the results aren't too bad. For example, with this real seq:

the real SS is:

the Yaspin (NN DSSP-trained with NR) prediction is:

my prediction is:

Not too bad for the first version.

But now I have a problem, if I check a non real protein made with pieces of a real protein with only sheets, for example:


Question 1: when I wrote the sequence was expected the result as sheets.
Is there any chance that Yaspin prediction be mistaken because of the seq is absurd?
Without a real protein I do not know how to validate it.

Question 2: Is there a seq/SS db with Foldit designs?

Question 3: I think one of the problems in my program is that I let any variations. For example, if I search AVVLTMSA with 1 variation,
then AVVLTXSA is valid for me (X=any). But I guess not all mutations are valid.
Is knowing the possible mutations a solvable problem?