update on technical difficulties
Dear Foldit users,
as most of you know, we've had partial functionality of the game for a number of days. Here is some background behind what happened. Two days before the Nature paper came out, the cooling system in the machine room hosting foldit servers malfunctioned. The temperature reached the threshold when the power automatically shuts off in the entire room. This has happened before, and in the past we have recovered quickly by starting up all our servers. Unfortunately, this time the sudden loss of power completely trashed our RAID file server. Unlike our database server and the web servers, we don't have the shadow copy of the filesystem so when it went down everything went down. During that time most of the key personnel was out of town (I was in Norway, Seth in Japan, Firas in Turkey, etc), and helping remotely proved to be very challenging.
We then tried to copy all the information from the filesystem RAID disks. The second attempt of trying to copy this massive set of files worked. At that point we decided to resurrect the copy of the server on our development machine. This copy had some functionality missing, but because of the big press wave, we decided that it was better to have a partly working server than no server at all. The next day the main servers were ready at which point we switched back. As a result of all the copying, some parts of the system didn't have the right permissions set which still left some parts of the system not functional. At the same time, the database server was at it's limits for the queries it can possibly server due to the fact that we had an 80-fold increase in daily registrations due to the press wave. We've had press waves before, but we've never seen this magnitude of interest. the DB server query queue had to be restarted several times in order to not completely bog down. this of course created many in-game and web portal timeouts.
From what we can tell it seems that all the functionality is back.
we've learned several things from this perfect storm. We will use facebook as an update mechanism in the future in the case our web servers are not functional. we'll also have our development server ready to be switched as a main server with limited functionality as a temporary solution. Longer term, as soon as we get funds to revamp our server structure, I have decided to move the entire server structure to the cloud (most likely Amazon services), which will make it completely robust to failure of any individual machine. Furthermore, we can easily scale up the entire infrastructure any time we need to increase our capacity.
Our apologies to all of you who were frustrated with the lack of full functionality in the past days. It is still possible that partial of full downtimes happen, but we'll be keeping you up to date on the blog (and facebook if the whole server is down.
feel free to respond to this post if you notice some functionality still missing.
Zoran( Posted by zoran 126 3643 | Tue, 08/10/2010 - 09:46 | 5 comments )