Author Topic: Cody got into the electrical cables again .. Server went down on 12/8/11 (Read 29130 times)

mouser · « **on:** December 07, 2011, 10:02 PM »

Actually it was the entire hosting company (Softlayer) network where the DC server lives.. They lost power for a while.
Sorry for the downtime, and a huge thanks to Gothic for bringing everything back up cleanly.

tomos · « **Reply #1 on:** December 07, 2011, 10:17 PM »

Thought there was some funny business going on,
welcome back ;-)

EDIT/ do you mean 1st December by 12/1?
(Site was down an hour ago or so - dont know for how long - but this often happens around this time due to backup - usually get a message though)

tomos · « **Reply #2 on:** December 07, 2011, 10:28 PM »

you got your edit in there just before mine...

rgdot · « **Reply #3 on:** December 07, 2011, 11:17 PM »

Cody did a bit more damage it seems. Why is this post, for example, being time stamped 11 something PM? It's 6:09AM Eastern right now.

mouser · « **Reply #4 on:** December 08, 2011, 05:25 AM »

thanks for pointing that out rgdot.. that seems to happen every time the server is rebooted. should be fixed now.

Renegade · « **Reply #5 on:** December 08, 2011, 05:31 AM »

It took a long time.

I don't know WTF is up with the Seattle data center... I'm in Texas, and never have any problems there. None.

Isn't this like the second time they've had that same problem?

rgdot · « **Reply #6 on:** December 08, 2011, 05:32 AM »

Thanks mouser and Gothic

mahesh2k · « **Reply #7 on:** December 08, 2011, 02:32 PM »

I thought cody took my Alien vs Cody image seriously

IainB · « **Reply #8 on:** December 08, 2011, 03:55 PM »

Actually it was the entire hosting company (Softlayer) network where the DC server lives.. They lost power for a while.
-mouser (December 07, 2011, 10:02 PM)

Is it normal for hosting companies to have power outages?
If so, then maybe I am missing something here, because I don't understand that.
By training, whenever I have been involved in setting up data centres, I have had to ensure that the risk mitigation plan always insists on there being at least four basic built-in redundancies (i.e., quite apart from computer system redundancies):

dual/backup air conditioning systems.
dual telecomms links (using two different telco suppler networks).
onsite single or dual backup diesel power generators - which automatically kick in when the power dies/fluctuates.
interim UPS (batteries) for server systems. (This supply allows sufficient time for the generators to get up to full capacity after they automatically kick in.)

If your hosting contract expressly excludes these things, then presumably you would receive service at a significant discount.

mouser · « **Reply #9 on:** December 08, 2011, 04:19 PM »

the crazy thing is we are paying for an expensive hosting company (softlayer) specifically because they are supposed to be one of the most reliable companies with the best redundancies, etc.

IainB · « **Reply #10 on:** December 08, 2011, 07:14 PM »

the crazy thing is we are paying for an expensive hosting company (softlayer) specifically because they are supposed to be one of the most reliable companies with the best redundancies, etc.
-mouser (December 08, 2011, 04:19 PM)

Well then, I would recommend (if you don't mind, and I'm not trying to teach you to suck eggs), from l-o-n-g experience of IT service contacts on both sides (customer/supplier), that you carefully scrutinise your contract and/or SLA (Service Level Agreement) for conditions and in particular penalty clauses in the event of deteriorated service or loss of service.

If there are any penalty clause provisions, then I think (from memory) you could legally claim either of:
(a) actual or reasonable notional consequential costs, or loss of revenue/profit arising from the outage.
or:
(b) punitive damages ("And don't do it again!" type damages)
- but not both.

If there isn't any penalty clause, then you may have unwittingly signed a contract with no teeth for the customer in the event of an outage such as this. A contract for "All care and no responsibility".

If you get no financial recompense for the outage, and if you believe that you are:

... paying for an expensive hosting company (softlayer) specifically because they are supposed to be one of the most reliable companies with the best redundancies, etc.

- then I'd suggest that you may have been paying out money under false pretences for as long as you have been using that supplier, and should swap suppliers because of that fact alone, and ask for a full/partial refund.

If you told your account manager/rep. that you were considering this, then it might be interesting to see what sort of response that gets.

A favourable (to you) response would probably indicate that they are interested in holding onto your business.
An unfavourable response would probably give the lie to any notions or expectations of "customer care and QOS" that you might have held regarding this supplier.

As to the outage itself, if it really shouldn't have happened because the supplier - to your knowledge - had all the appropriate redundancies/backups in place, then - by definition - that could means that there was a process failure somewhere.
In my experience the main thing that usually gets in the way of a good service manager providing his services to meet an SLA is a pencil-head (usually an accountant). So be on the lookout for that as a possibility. They may have been cost-cutting and hoping to get away with it.

IainB · « **Reply #11 on:** December 08, 2011, 07:49 PM »

I was very curious as to what Softlayer had by way of the things I listed above:

dual/backup air conditioning systems.
dual telecomms links (using two different telco suppler networks).
onsite single or dual backup diesel power generators - which automatically kick in when the power dies/fluctuates.
interim UPS (batteries) for server systems. (This supply allows sufficient time for the generators to get up to full capacity after they automatically kick in.)

So I got onto the main website and clicked on "Chat with a real person" and type-chatted with one "Austin P".
He pointed me to: Data Centres

From that, it looks like a really well set-up and professional outfit. They seem to have all the usual power and battery backups sorted.

So, what was their explanation for the outage that hit you?
And did it hit all their customers similarly, from that data centre?
Enquiring minds need to know.
Customers hit by the outage would expect to be told.

mouser · « **Reply #12 on:** December 08, 2011, 07:55 PM »

And did it hit all their customers similarly, from that data centre?

yeah it took down the whole data center in seattle as far as i know.

here's what they said to us.. though i would take these things with a gain of salt as my experience is hosting companies stretch the truth a bit to paint things rosier than what they were:

Around 02:54 UTC on 08-DEC-2011 02:54, there was a disruption to utility power to our SEA01 datacenter facility. The backup generator / UPS system subsequently experienced trouble backing up the critical load, and at 03:16 UTC on 08-DEC-2011 a section of our datacenter went offline due to power loss. This power outage took our site core routers offline, which caused a total network disruption to servers in the SEA01 facility. Additionally, some servers lost power.

Power was restored for the network equipment at 03:32 UTC on 08-DEC-2011, at which time servers in pod 02 in SEA01 (fcr02.sea01 & bcr02.sea01) came back online, as well as back-end for pod 01 (bcr01.sea01). The front-end router for pod 01 (fcr01.sea01) did not recover properly after power was restored, and finally at 03:50 UTC on 08-DEC-2011 all network services were back online.

Our facilities and system administration team are currently working on restoring service to any servers or CCIs (cloud instances) that did not automatically recover after power resumed.

IainB · « **Reply #13 on:** December 08, 2011, 08:36 PM »

Looks like someone probably screwed up big time - and it will probably be human error, because automated power UPS/generators work just fine otherwise.

The backup generator / UPS system subsequently experienced trouble backing up the critical load, and at 03:16 UTC on 08-DEC-2011 a section of our datacenter went offline due to power loss.

This tells you what happened, but not why.
Would be interesting to know what they determine the cause to be.

anandcoral · « **Reply #14 on:** December 09, 2011, 01:52 AM »

I had sweats on my forehead, when I could not get through in the morning session. Though I do not remember how long I kept refreshing the "downforall.." page with the DC link, it did looked like more than hour.

I can not say how I felt relieved when it said DC was running.

One suggestion, since at the time of DC down, we all were asking just one question "what happened ?". Now can we have a mirror/ separate site, with just the official information of current activity of DC ? Mouser or a volunteer can keep it updated once a day in normal cases and frequently in down time (hope it does not happen again). No more that 1 or 2 pages and no comment features, just read only.

Since both will reside in separate place/ server etc. Once can check the other if he/she has problem getting through the first.

What do you think about it ?

Regards,

Anand

mouser · « **Reply #15 on:** December 09, 2011, 02:39 AM »

It's a very good idea. We need to have another small forum somewhere so we can meet when donationcoder main site goes offline, to provide information, etc.

Meanwhile if you install an irc chat program (lots of free ones), you can always find us chatting any hour of the night on the efnet network, on channel #donationcoder (that's where you go if you hit the chat button at the top of the page).

Renegade · « **Reply #16 on:** December 09, 2011, 02:41 AM »

mouser, isn't this the second time that has happened?

(I'm with Softlayer as well, but in the Texas data centers, and I've NEVER had any problems.)

Renegade · « **Reply #17 on:** December 09, 2011, 02:44 AM »

It's a very good idea. We need to have another small forum somewhere so we can meet when donationcoder main site goes offline, to provide information, etc.

Meanwhile if you install an irc chat program (lots of free ones), you can always find us chatting any hour of the night on the efnet network, on channel #donationcoder (that's where you go if you hit the chat button at the top of the page).
-mouser (December 09, 2011, 02:39 AM)

Is the cloud actually at the point where you can run a real application in the cloud? Like a forum? A data driven application? I mean like a decentralized solution that you can still run off of traditional DNS, and not a uber-massive DDNS redundant server beast. Just a simple little solution that's decentralized and will let small sites (e.g. 1 server) run?

IainB · « **Reply #18 on:** December 09, 2011, 04:33 AM »

Is the cloud actually at the point where you can run a real application in the cloud? Like a forum? A data driven application? I mean like a decentralized solution that you can still run off of traditional DNS, and not a uber-massive DDNS redundant server beast. Just a simple little solution that's decentralized and will let small sites (e.g. 1 server) run?
-Renegade (December 09, 2011, 02:44 AM)

Well, you could migrate DCF to Google groups, I suppose...
That's distributed and backed up all over the place, I gather. Not sure if that means that it is in the "Cloud" though.

JavaJones · « **Reply #19 on:** December 11, 2011, 03:50 PM »

Seriously, is down time enough of an issue that we need a *second forum* to "discuss DC stuff while the main forum is down"? Woah. That's... crazy in my opinion. Just get a reliable hosting service! The IRC channel is enough for those "in the know" (which is the only people who would know about and use an alternate forum when the main one is down anyway). I seriously don't think resources and time should be wasted on a secondary forum.

Anyway, has anyone ever heard of a data center power outage that *went well*, i.e. according to plan? Something like "At 3PM Pacific Time on December 11th, our San Jose data center lost power. Our UPS units kept all systems online while our backup generators kicked in and there was no interruption of service. Our backup generators powered all systems for 9 hours until the utility company could restore power. Thank you for your patronage." End of story. I think I've seen that maybe *once*, yet 10s of times I've seen "Our backup systems got overloaded and failed, then things went down for x amount of time. Sorry!" What good are backup systems that fail themselves?

What about switching to a different Softlayer data center?

- Oshyan

Renegade · « **Reply #20 on:** December 12, 2011, 02:52 PM »

Did it go down again? I noticed more downtime.

Ath · « **Reply #21 on:** December 12, 2011, 03:00 PM »

All I saw was that today the forum-backup had shifted forward (as in: later than usual) 1 hour, but that could be on purpose, or have to do with the server time setting issue there was after getting re-powered, as rgdot pointed out earlier in this thread.

mouser · « **Reply #22 on:** December 12, 2011, 03:25 PM »

It did go down again last night.. not softlayer but our server.. it started the backup process and cpu usage climbed to the point where the server was unreachable and never came down until the server had to be hard rebooted..

40hz · « **Reply #23 on:** December 12, 2011, 04:13 PM »

Was that a virtual machine, or the hardware server itself that a maxed out?

mouser · « **Reply #24 on:** December 12, 2011, 04:15 PM »

the vmware virtual machine running donationcoder.com; we could see the cpu load go off the chart right as it started performing backups and it just never came down on its own and we couldn't get into the vmware console for it. very strange.