1 reply [Last post]
Joined: 08/09/2010

Hi

I am writting a C++ program (and I hope it will be a script) to find the secondary structure for a sequence.
I am using NR database (2GB), I have converted it in two dictionaries with segs grouped by its SS: H or E (size 5MB).

The program builds a sorted b-tree to search sections of the sequence and it lets n configurable mutations.
The reason for doing this program is that we use more sheets than there are in real proteins, so I wanted to do three things:
- add Foldit designed proteins
- increase/decrease the weight of the sheets/helix.
- write it in Lua with options for players

I have finished a first approximation, and the results aren't too bad. For example, with this real seq:
AMIEIKDKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQEVYEMNFGEKPEGDITQVNEKTIPDHDILCAGFPCQAFSISGKQKGFEDSRGTLFFDIARIVREK
KPKVVFMENVKNFASHDNGNTLEVVKNTMNELDYSFHAKVLNALDYGIPQKRERIYMICFRNDLNIQNFQFPKPFELNTFVKDLLLPDSEVEHLVIDRKDLVMTNQEIEQTT
PKTVRLGIVGKGGQGERIYSTRGIAITLSAYGGGIFAKTGGYLVNGKTRKLHPRECARVMGYPDSYKVHPSTSQAYKQFGNSVVINVLQYIAYNIGSSLNFKPYY

the real SS is:
-------------EEEEE-----HHHHHHHH---EEEEE----HHHHHHHHHHH--------------------EEEE-----------------------HHHHHHHHHHH
---EEEEEEE---------HHHHHHHHHHHH-----EEEEEE----------EEEEEEEE----------------------------------EE-----EE---------
----EEEE--------EEEE--------------------EEEE--EEE---HHHHHHH------------HHHHHHHHHH---HHHHHHHHHHHHHHHH-----

the Yaspin (NN DSSP-trained with NR) prediction is:
------------EEEEEEE--HHHHHHHHHH---EEEEEE---HHHHHHHHH--------EEEEE------EEEEEEE-----------------------HHHHHHHHHHH
---EEEEE--HHHHH---HHHHHHHHHHHHH---EEEEEE--HHH------EEEEEEEEE----------------------------HHHHHHHHHH--------------
------EEEE-----EEEE---------------------------------HHHHHHHH-----EEE---HHHHHHHH-----HHHHHHHHHHHHHH-------

my prediction is:
-----------------------HHHHHHHHH----------HHHHHHHHHHH-----------------------------------------------HHHHHHHHHHHH
--EEEEEEE-HHHHHHHHHHHHHHHHHHHH--EEEEEEEEE----------EEEEEEEEE----------------------EEEEEEEEEEEEEE----------------
---EEEEEEE-HHHHHHH----------------------EEEEEEE-----HHHHHHHH-----------HHHHHHHHHH---HHHHHHHHHHHH---------

Not too bad for the first version.

But now I have a problem, if I check a non real protein made with pieces of a real protein with only sheets, for example:
AAAATLAAAAVIGAAAAVVLTMSAAACVVLAAAAFESAAAFSFAAAFSVAAAFTPHAAAFVEIAAAGIVSYAAAGKVYAAAGVFAAAIA

Yaspin says: -HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH--HHHHHHHHHHH--HHHHHH--HHHHHHHHHHHHH--
And the program: HHHH--EEEEEEEEEEEEEEEEEEEEEEEEEEEE--EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE---EEEEEEEE-

Question 1: when I wrote the sequence was expected the result as sheets.
Is there any chance that Yaspin prediction be mistaken because of the seq is absurd?
Without a real protein I do not know how to validate it.

Question 2: Is there a seq/SS db with Foldit designs?

Question 3: I think one of the problems in my program is that I let any variations. For example, if I search AVVLTMSA with 1 variation,
then AVVLTXSA is valid for me (X=any). But I guess not all mutations are valid.
Is knowing the possible mutations a solvable problem?

thanks
BitSpawn

Joined: 08/09/2010
source and binary for linux

You can download the first version for testing from here:
http://www.argoslabs.com/foldit/buscador_linux_src.tgz

Sitemap

Developed by: UW Center for Game Science, UW Institute for Protein Design, Northeastern University, Vanderbilt University Meiler Lab, UC Davis
Supported by: DARPA, NSF, NIH, HHMI, Amazon, Microsoft, Adobe, RosettaCommons