First of all, while mouser states "we don't know" he really means "he doesn't know"
I tried to explain, but oh well...
The problem is this:
Recently we have been implementing a lot of caching in order to speed things up. The database currently does quite some heavy query caching, and on the apache side we have besides the usual stuff, APC for php opcode caching (which the forum software, smf, can use the API for). All of this, of course comes at a cost of memory.
We had maxclients set quite high (200), which makes things run pretty fast with prefork, since there's always processes around to serve new clients. The problem is that the average apache process size has grown to around 30MB, so worst case scenario is 20*30MB=5.8GB, which is more memory than the server has.
In fact, after the aggressive MySQL caching, there is only 780 MB free for apache to play around with.
This is why I have decreased MaxClients to 20, which will prevent us from running out of memory, at the cost of some slowdown.
So really, all we have to do at this moment is probably just make the MySQL caching a bit less crazy, to make some more room for Apache.
As f0dder points out sort of kind of, the fact that Apache with prefork is process based, it makes memory consumption a bit hard to predict (since the memory size of a process changes by what script is running, moreover, if you use keepalive, it grows with the number of requests a client makes over the same connection to the same client thus making things even more unpredictable).
I wouldn't go as far as switching httpd, but we could switch MPM. There is an experimental event MPM for apache, but I'm not sure how I feel about using an experimental MPM on a production server.
To make matters worse, packet loss at softlayer has been really hurting us.
This is their latest excuse:
SoftLayer Engineers are aware of the sporadic packetloss and/or connectivity to FCR01.SEA01. Currently engineers are working on resolving this issue; however, there is a chance a reboot to the router will be required. In the event this needs to happen, a notice will be posted here along with any other information gathered.
-- Update --
Service has been restored to customers behind FCR01.SEA01. During the process of working on the router, the issue manifested itself into 100% CPU resulting in upstream and downstream links going down along with routing protocols. Engineers were able to stabilize the router without a reload, and are currently monitoring it to determine if the fix is permanent.
-- UPDATE --
Engineers have determined this is not a permanent fix for the FCR01.SEA01 issue. During the course of troubleshooting this issue with Cisco, it has been determined that the best course of action is to upgrade the router to the latest IOS version. This will be happening at approximately 01:30 CDT.
--UPDATE--
Engineers are continuing to work with Cisco TAC on this issue. At this point, the router has been restored to service at approximately 03:20 CDT. Some customers may continue to experience intermittent packetloss behind FCR01.SEA01.
Working for a hosting provider myself, I understand that stuff happens, but this packet loss stuff with them has been going on ever since we moved to their seattle data center
every day.
Also, one would think that softlayer is big enough to have a replacement router ready they can just put in (or have a proper redundant network that can route around it for that matter).