The forum and website outages were caused by runaway web server processes. We took a systematic approach to resolving it, and found several different problems, all of combined to cause excessive load on the server. The main problems we found were related to user content and the way that the web server talked to the database. Our web guru rewrote a fair amount of the user content flow, which cured most of the problems caused by UC. As a precautionary measure, we're going to change some of the backend infrastructure so that the UC is handled on completely different servers in the future.
The database communication issues took a little longer to resolve. Without going into too much detail, when we experienced burst traffic to the website, the web server CPU utilization would go through the roof. This happened most commonly when a game server was taken offline. We tracked this down to rate limiting that the database was doing. To fix this, we changed the way we connect the web server to the database from a single connection per request to shared connections called 'Connection Pooling' using FastCGI. We immediately saw measurable performance improvements, and when we took the cluster down, the web server stayed up.
At this point we believe we have all of the major website performance issues resolved, but we continue to optimize and improve the web infrastructure.
On SOE outages:
We have had a few unannounced SOE maintenance periods. We are working with SOE Operations to make sure we are notified in advance of planned outages that will affect us, and that we are proactively notified of any unanticipated down time. We are the first major partner to go live with SOE Platform Publishing, and as a result, we are working through the inevitable process issues together. We are confident that we will continue to improve our communication. As with everything we do, FLS will be as proactive as possible when communicating with our community.
On maintenance announcements:
We believe very strongly in bring as proactive as possible when notifying our community of outages and maintenance. Part of our standard process when performing planned maintenance and dealing with unplanned downtime is to be as open and candid with our community as we can. This may lead to an abundance of maintenance notifications, but we feel that it is better to provide our customers and community with more information instead of simply acknowledging downtime when it occurs. We are also reaching the end of our closed beta, which is a time for us to do some final overhauling of some of our infrastructure prior to launch. This will naturally lead to additional outages as well, as we prepare for what we hope will be a tremendously successful launch.
So there you have it. We at Coldfront would like to thank Gray Noten for taking time out of his clearly busy schedule to answer this question.