We sincerely apologize for tonight's downtime, as it happened for almost a full two hours, which is much longer than any previous unplanned downtimes. As per the
RolePlayGatewayTwitter account (and thankfully we had one, so we could announce updates while the server was down: follow us now if you haven't already), it was completely unexpected and we have decided to apologize to the community with a fully detailed explanation as to what happened, why it happened, and what we've done to prevent it from happening in the future.
Warning: Complex and mind-numbing code-fu and server leetness condensed into simple, readable sentences with silly similes and morbid metaphors ahead. Read at your own risk.Enter the RolePlayGateway server, a complex quad-core beast with several interacting components. As many of you are aware, we have previously had a lot of issues with the
egosearch (or "View Your Posts") feature, including but not limited to blank pages and empty results. This feature is one of the most heavily used tools on the site, as it allows quick and easy access to all of the topics in which you've posted.
However, this tool is not without its flaws - being that it runs a very extensive set of queries on the server, it often puts a heavy load on our SQL database. The SQL database is where every single bit of RolePlayGateway's data is stored, and when posts are made on the forum, tiny little robots scan through this database and update and retrieve all of the information necessary to serve your request. Users with a large number of small posts may notice more problems than users with a smaller number of bigger posts, because the overhead for finding these files is based on the total number of posts. Of course, when many users are on the site at the same time, there are literally
hundreds of these little robots (called "pointers") running around and updating the database.
The problem stems from when one robot is updating a chunk of data that another robot is waiting for. To make sure no one loses any data, only one robot can access any part of the database at a time. This amount of time is usually very, very small (during optimal performance, this should take no longer than 0.001s) but when a robot needs to check hundreds of other parts of the database before updating another, and there are several robots waiting for those parts, there can be what is a veritable traffic jam. This traffic jam, if not resolved quickly, often cascades into much longer wait times (during our worst times, up to around 5.000s) for each query.
To resolve this, we decided to eliminate a lot of our robots in favor of one big robot that keeps its own copy of the database for search queries only. To do this, we implemented a new Search Engine on the back end of our site called
Sphinx. This search engine plugs directly into our site's server and does the same work that the little robots used to do in one big chunk, but in a separate bulding, if you will. This frees up the hallways in the database for more little robots to do other work, which ultimately makes the whole site run much faster. There is one small drawback, in that occasionally your searches will come back with no results, but this is an issue we are working on and hope to have an update for you in the future.
When Sphinx communicates with the server, it does so over a so-called "tube", or a
port. All communication is sent through this port between our HTTP server (Apache) and Sphinx, and is exactly what frees up other tubes for the rest of the work around the site. Apache is who your browser talks to (Be it Google Chrome, Firefox, or Internet Explorer) when it wants to visit RolePlayGateway, and Apache listens to you, tells the robots to do work, then sends you exactly what you requested within a matter of milliseconds. If all of those little SQL robots get their job done very quickly, you can usually get your page back in under 0.100s. However, it is important that this tube (the port between Sphinx and Apache) is created in the proper order, or disastrous situations arise.
Such was the case tonight. In some special circumstances (currently set to 4000 requests), Apache needs to restart itself so it can make sure it has a clean work area and that it can quickly handle everything that the wonderful users of RolePlayGateway ask of it. Unfortunately, Apache isn't the smartest worker in our roleplaying factory of awesomesauce, and it occasionally forgets that it needs to communicate with Sphinx before starting up. When it tries to restart without telling Sphinx, it will go into a "
Stopped" state (or flat our
crash), and will simply fail to restart. When this happens, the robots running the database finish their work and sit idle, because Apache isn't telling them what to do.
What we did to fix it...This particular problem has happened three times in the past couple weeks, but since we have such an amazing
group of RolePlayGateway Administrators, it was usually caught and fixed in a matter of minutes. Unfortunately, we missed it this time. We've updated our
monit configuration to actually do the work automatically if it detects that the server is down for more than a few minutes. (Monit is like another big robot that
monitors things on a server)
And so, long story short - we're
very sorry about the inconvenience, and we've worked hard to make sure this problem doesn't happen again. If it does, ping us on
Twitter or shoot us an email (we're admin [at] roleplaygateway (dot) com), and we'll get the problem resolved as quickly as possible!
Now, back to your regularly scheduled roleplay...