About SS predictions

Started by BitSpawn

BitSpawn Lv 1

Hi

I am writting a C++ program (and I hope it will be a script) to find the secondary structure for a sequence.
I am using NR database (2GB), I have converted it in two dictionaries with segs grouped by its SS: H or E (size 5MB).

The program builds a sorted b-tree to search sections of the sequence and it lets n configurable mutations.
The reason for doing this program is that we use more sheets than there are in real proteins, so I wanted to do three things:

  • add Foldit designed proteins
  • increase/decrease the weight of the sheets/helix.
  • write it in Lua with options for players

I have finished a first approximation, and the results aren't too bad. For example, with this real seq:
AMIEIKDKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEKPEGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREK
KPKVVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMICFRNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTT
PKTVRLGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMGYPDSYKVHPSTSQAYKQFGNSVVINVLQYIAYNIGSSLNFKPYY

the real SS is:
————-EEEEE—–HHHHHHHH—EEEEE—-HHHHHHHHHHH——————–EEEE———————–HHHHHHHHHHH
—EEEEEEE———HHHHHHHHHHHH—–EEEEEE———-EEEEEEEE———————————-EE—–EE———
—-EEEE——–EEEE——————–EEEE–EEE—HHHHHHH————HHHHHHHHHH—HHHHHHHHHHHHHHHH—–

the Yaspin (NN DSSP-trained with NR) prediction is:
————EEEEEEE–HHHHHHHHHH—EEEEEE—HHHHHHHHH——–EEEEE——EEEEEEE———————–HHHHHHHHHHH
—EEEEE–HHHHH—HHHHHHHHHHHHH—EEEEEE–HHH——EEEEEEEEE—————————-HHHHHHHHHH————–
——EEEE—–EEEE———————————HHHHHHHH—–EEE—HHHHHHHH—–HHHHHHHHHHHHHH——-

my prediction is:
———————–HHHHHHHHH———-HHHHHHHHHHH———————————————–HHHHHHHHHHHH
–EEEEEEE-HHHHHHHHHHHHHHHHHHHH–EEEEEEEEE———-EEEEEEEEE———————-EEEEEEEEEEEEEE—————-
—EEEEEEE-HHHHHHH———————-EEEEEEE—–HHHHHHHH———–HHHHHHHHHH—HHHHHHHHHHHH———

Not too bad for the first version.

But now I have a problem, if I check a non real protein made with pieces of a real protein with only sheets, for example:
AAAATLAAAAVIGAAAAVVLTMSAAACVVLAAAAFESAAAFSFAAAFSVAAAFTPHAAAFVEIAAAGIVSYAAAGKVYAAAGVFAAAIA

Yaspin says: -HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH–HHHHHHHHHHH–HHHHHH–HHHHHHHHHHHHH–
And the program: HHHH–EEEEEEEEEEEEEEEEEEEEEEEEEEEE–EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE—EEEEEEEE-

Question 1: when I wrote the sequence was expected the result as sheets.
Is there any chance that Yaspin prediction be mistaken because of the seq is absurd?
Without a real protein I do not know how to validate it.

Question 2: Is there a seq/SS db with Foldit designs?

Question 3: I think one of the problems in my program is that I let any variations. For example, if I search AVVLTMSA with 1 variation,
then AVVLTXSA is valid for me (X=any). But I guess not all mutations are valid.
Is knowing the possible mutations a solvable problem?

thanks
BitSpawn