One of the options my company offers is SaaS to companies who need a plug-in to help with their logistics operations. We provide a little network of servers that host several web services with various API's. Obviously, as part of this we have to offer support and are committed to fairly aggressive SLA's.
In order to help with this, I decided to implement a continual monitoring system to alert us of any systems falling over in our little farm. Being on a limited budget and an inveterate code tinkerer, I set up a little project to see if I could rustle something up. As far as I am aware, this is a novel approach to real-time systems monitoring. A sort of "poor man's Tivoli" if you like.
I have set up each machine in our cluster to monitor two others. I've found this to be the optimal arrangement. If we lose a node, we get a maximum of two other servers alerting us to the fact. Even multiple node failures will always ensure a manageable quantity of alert messages instead of being swamped.
So what happens is that every minute, each machine asks its partner machines their status. This is achieved by a small self-written program installed as a Windows service. Every 60 seconds it wakes up and executes a simple call to each web service that it knows about. Depending on the result of this call, the software performs various logging operations and takes further actions if required.
In my case, our webservices will respond to the equivalent of a ping request, returning a brief status message.
One of three things can happen:
- no response: panic in the streets!
- bad response: check as soon as possible
- good response: back to sipping coffee
This diagram exemplifies the steps taken.
I decided to keep it very simple: if we bump into an issue, the event is logged so we can trace back when things began to go wrong. But also, if there is an important state change, several of us on the support team are instantly notified by a Skype instant message right to our desktops. Our monitoring app can also dial out via Skype to whoever is on call.
And this is the part that I believe is quite unique. I utilised the Skype API so that we could use it to send messages to anyone who was currently interested in the status of our servers, without them having to be logged in. Or even have any type of account on the boxes. This means we can safely include end-users if they are interested in collating their own uptime stats.
So after installing skype on our servers, all I had to do was write a little bit of code. This was almost pathetically easy: the entire monitoring application is perhaps 300 lines of Delphi code. Skype provides an API with some easy examples to show you how to hook in. Every 60 seconds the program awakens and firstly pings each node on its list. If successful it then makes a status enquiry to the web services themselves.
Code fragment from the app's OnCreate method: (constructor)
- Skype := TSkype.Create(self);
- Skype.OnMessageStatus := SkypeMessageStatus;
- Skype.Attach(8, False);
- SendMessage('Bxxxxx Monitor started');
- on e: Exception do begin
Code fragment showing how easy it is to send an IM:
- if Skype.AttachmentStatus <> 0 then
- Skype.Attach(8, false);
- Skype.SendMessage(SkypeContacts, message);
- on e: Exception do begin
... and to handle incoming messages:
- procedure TMainForm.SkypeMessageStatus(Sender: TObject;
- const pMessage: IChatMessage;
- Status: TChatMessageStatus);
- case Status of
- cmsReceived: Logger.Log('Recv ' + pMessage.FromHandle + ': ' + pMessage.Body);
- cmsSent: Logger.Log('Sent ' + pMessage.Chat.DialogPartner + ': ' + pMessage.Body); // handle multiple partners
As you can see, what's really nice is that I can now write entries into any server's log file from my desktop, simply by sending an IM from my skype account to the destination server! We use this facility to annotate various activities and it provides a useful audit trail.
Sample from our log file:
- 9/25/2010 1:36:22 PM-I-Sent Timxxxxx: WARNING: Bxxxxx DOWN on node xxx.xxx.xxx.xxx
- 9/25/2010 1:36:22 PM-I-Recv Timxxxxx: Ok, I got this one
- 9/25/2010 1:36:25 PM-I-Sent Txxxxx: INFORMATION: Bxxxxx UP on node xxx.xxx.xxx.xxx
- 9/25/2010 1:36:27 PM-I-Recv Timxxxxx: Ping timeout. I blame the ISP ;)
Of course at this point you are saying to yourself "how do you keep skype running when you are not even logged into the servers?" and it's a good question. In order to keep skype running, and have it survive reboots without a re-login, I set up the program as a Windows service. To achieve this, I used the excellent XYNTservice program
In conclusion I believe this approach is very useful for small and budget conscious businesses who need some automated way to keep an eye on the health of their services. I've been using this technique for 9 months now and it has proven to be very reliable, and so far we have not missed a single occurence of downtime.
Summary of advantages
- It is a lightweight, free solution
- It has proven to be very stable
- You get an instant message and/or a call immediately any problem is detected
- You and your team can subscribe or unsubscribe to any of the servers contact lists in order to switch yourselves in and out of support (we use block/unblock)
- It requires minimal development and configuration
- The paradigm is scalable
- You have to have skype running. Personally I rely on skype a great deal, and encourage our userbase to contact us via skype for low-priority support.
- If Skype goes down, as it famously did in early January 2011, you will have to find an alternative way to check your systems
- I would not call it an Enterprise solution
If you're at all interested in more information just give me a shout. I would be happy to show you an instance of the system in action on a test server, and/or help you set up something similar for yourself.