We apologize again for the issues experienced this morning. Here is an explanation of what happened and what steps we are planning to implement to resolve this matter permanently.
Please remember, that we are currently in a domino sequence of events that all started last week when our servers experienced a DoS cyber-attack. To mitigate the attack we implemented CloudFlare which helps filter out unwanted traffic. When CloudFlare was implemented it initially caused conflicts with our security certificates. Once that was resolved, we all started to notice some slowness in the software. This was the result of the extra filtering and rules that are now being applied to all traffic crossing our servers.
As we were not satisfied with the assistance that we received with attacks from our current hosting provider Rackspace, we decided that we should move our infrastructure to AWS. In fact this was already in the works from a purely infrastructure/technological standpoint, but when the service response failed to meet our expectations we decided to accelerate our plans.
Initially we thought we could wait until this weekend to address the slowness and felt much of it would be resolved with the move to AWS. However, as the complaints grew this week, we decided to address it last night with a simple task of doubling the memory on a cluster of servers that were most impacted performance wise. This is something we have done numerous times in the past with no issue whatsoever, so we felt confident that this would help the situation.
Unfortunately, when this happened, the technician from Rackspace accidentally replaced a key configuration file on these servers with the old, default configuration file. Now, this configuration file contains many settings. For example, one of them is how many simultaneous connections are allowed to the database. Well the default value is 30, but the real value is in the thousands - the maximum allowed.
So, when our team tested everything last night with these databases, everything was ok - because after all it was a single person testing, not 31 or the hundreds of users that are assigned to each server. Therefore, this morning when everyone started hitting the server, naturally we eclipsed 30 connections and the problem was revealed.
While we were troubleshooting, we felt that we could pull the data directly from the databases impacted, but we learned that the changes in the configuration file also prevented our direct communication to the server.
Now, the real issue at this point is when we called Rackspace, the configuration file before the update was gone at that point (overwritten). It took having to speak with several support people over the course of two hours before finally getting the matter resolved. This whole experience with Rackspace is unacceptable and it just solidifies our planned move to AWS.
As of now, we are still on schedule to move the entire infrastructure to AWS this weekend. But, as a result, there will be some planned downtime this weekend.
So given the experience over the past week, coupled with a major infrastructure move, we are planning to generate reports of everyone’s calendar (service call report) so that if should something should happen over the weekend, everyone has their calls and are in a position to take care of your customers. I would also recommend printing these reports out internally on Friday for next week.
Now there are some things we can do better as an organization and we are going to implement those starting tonight.
First, our support was not online when our east coast clients first started experiencing the problem. Had they been online and the problem escalated to the dev team at that point, the problem would have been resolved sooner.
Therefore, we are shifting our support hours to start at 5am central time starting tomorrow – June 3rd – three hours earlier than normal. Our HelpDesk support will be online monitoring and responding to tickets at that time with the official now from 5am to 5pm central time.
Second, we are going to run a report nightly that contains all of the service calls for each client, such that if the server is unavailable for whatever reason in the future, we can at least, immediately send everyone their calls for the day. Once the move to AWS is completed, we will setup this report to send directly to each company. In the meantime, it will be generated internally so that we have it and can distribute it as needed.
Again, I am so terribly sorry for the inconvenience, confusion and frustration felt over the past week, especially today. We have been working around the clock ever since the first cyber-attack last week and we won’t quit until things are back to normal.