5 replies [Last post]
Tlaloc's picture
User offline. Last seen 9 weeks 6 days ago. Offline
Joined: 08/04/2008
Groups: Mojo Risin'

I am starting this thread as a place to discuss the disaster recovery from the server crash. I want this thread to be constructive criticism, not a place to gripe about stuff. What can the foldit guys take away that will improve their recovery attempts if this ever happens again. For stuff that still isn't working, what can they do different policy-wise than they are doing now?

So to start:

The main thing we need is more information. A three sentence blurb on the blog about server difficulties isn't enough. There should be a little more talk about what went wrong and an ETA for when it will be fixed. This needs to be the first thing people see on the web site...even more important than the announcement about the Nature paper. It needs to be updated with new information as it occurs.

Joined: 09/18/2009
Groups: SETI.Germany
Right...

I was wondering about that, too.
Even asked about what's going on, but didn't get an answer on facebook.

Joined: 12/07/2007
Groups: Contenders
I think a brief note on the

I think a brief note on the Rosetta@home forums would be useful next time such an unfortunate event occurs.

mimi's picture
User offline. Last seen 36 weeks 2 days ago. Offline
Joined: 11/17/2008
Groups: Contenders
Please, please

Don't just leave everyone wondering what is happening. When problems occur communicate by whatever means are possible to tell us what the situation is - and update that at least daily even if there isn't any progress.
Even if it is all bad news it will make the users (players) happier if they have an idea of what the problem is, and if possible, what is being done and when it might be done by.

When it became apparent that there was a serious outage this time I tried looking in lots of places to see if there was any news - the Baker Lab site, Rosetta@home, etc - nothing.
Especially given the circumstances it would have been a good idea for potential new players to have been able to find a notice telling them there was a problem and asking them to try again later.

Now we have a skeleton service back up but with various bits missing and no information as to when they are likely to be available. Its not the lack of facilities that makes me upset its the not knowing.

Joined: 06/17/2010
Facebook or Tweeter. Just use

Facebook or Tweeter. Just use it to post smgs :]

zoran's picture
User offline. Last seen 6 years 36 weeks ago. Offline
Joined: 11/10/2007
Groups: Window Group
what happened and better notification mechanisms

Hi Folks,

You are absolutely right. We messed up with in the frantic rush for trying to recover the server. We didn't think of facebook as an alternate channel. we will surely do that in the future. Here is a brief description of what happened:

2 days before the Nature paper came out, something went wrong with the AC in the machine room hosting foldit servers. The temperature reached the threshold when the power automatically shuts off in the entire room. This has happened before, leading to a quick startup of the entire setup. Unfortunately, the sudden loss of power completely trashed our RAID file server. During that time most of the key personnel was out of town (I was in Norway, Seth in Japan, Firas in Turkey, etc), some of us tried to help remotely but this if course is a very different thing.

Unlike our database server and the web servers, we don't have the shadow copy of the filesystem so when it went down everything went down. We then tried to copy all the information from the filesystem RAID disks. The first attempt took way to long, so we went to plan B which worked. at that point we decided to resurrect the copy of the server on our development machine. this copy had some functionality missing, but because of the big press wave, we decided that it was better to have a partly working server than no server at all. The next day the main servers were ready at which point we switched back. As a result of all the copying, some parts of the system didn't have the right permissions set which still left some parts of the system not functional. At the same time, the database server was at it's limits for the queries it can possibly server due to the fact that we had an 80-fold increase in daily registrations due to the press wave. the DB server had to be restarted several times.

From what we can tell it seems that all the functionality is back. We will use facebook as an update mechanism in the future. Longer term, I have decided to move the entire server structure to the cloud, which will make it completely robust to failure of any individual machine.

Our apologies to all of you who were frustrated with the lack of full functionality.

Zoran

Sitemap

Developed by: UW Center for Game Science, UW Institute for Protein Design, Northeastern University, Vanderbilt University Meiler Lab, UC Davis
Supported by: DARPA, NSF, NIH, HHMI, Amazon, Microsoft, Adobe, RosettaCommons