Optimise Service Monitoring Alert Generation

Support, Questions and Discussions on MonitorWare Agent

Moderator: alorbach

Optimise Service Monitoring Alert Generation

Postby mgeschke on Tue Nov 16, 2004 1:46 pm

Hi
In line with organising our diskspace alerts, I would like to implement a more efficient Service Monitoring routine.

The following are my requirements:
- The NT Service Monitor is set to execute every 1 minute. It checks for any failed services.
- An Alert (Event Log Message) should be generated only the second time a specific service fails (this allows service restarts without notifying support staff unneccessarily).
- After the first Alert has been sent, the next Alert should only be generated in 30 minute intervals
- If a service recovers, the whole cycle should start again.

At first I thought to implement the above using a technique involving Status variables similar to what Rainer suggested to treat disk space Alerts, but here I ran into a problem. I cannot increment the value of a status variable by one to reflect the number of repepitions that a particular service has already failed (in the above example I would test for the status variable values of 1 and 30). Similarly, the minimum wait time option cannot be used as this is filter specific and I need to manage the state for an unknown number of services.

Have I expressed myself clearly? Any help would be greatly appreciated.

Regards
Mark
mgeschke
 

Postby alorbach on Mon Nov 22, 2004 6:10 pm

Hi,

this is a very complex task you want to accomplish with MWAgent.
Two weeks ago I created a configuration which does the following:

- NT Service Monitor
When a Service is not running, an error email is generated and a status variable is set. Then no more emails are generated. If the Service is started again, another Email is send saying that the Service runs again and the process starts from the beginning.

I think using this as a template for your task could be helpful.
But it will of course need some more modifications to work in a way you want it to.

best regards,
Andre Lorbach
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Postby Guest on Tue Nov 23, 2004 7:04 am

Hi

If I would have to guess, I would assume that you are using Status Variables in a similar Adiscon recommended implementing Disk Space Alerts. I thought about this, however, the requirements to the monitoring staff is to make sure that, when a service becomes unavailable, that they are periodically reminded that the service is unavailabel (there might be many reasons why somebody "forgets" to react on the first alert and needs subsequent alerts to "remind" him).

Looking forward to a possible solution.
Regards
Mark
Guest
 

Postby alorbach on Tue Nov 23, 2004 6:46 pm

Hi,

we needed a few modifications but I was able to create a Ruleset that does exactly (maybe even more) what you wanted.

But it will need the latest MWAgent RB version to run (at least build 262), see http://forum.adiscon.com/topic,684.html

The setup contains new samples now. If you gonna test it, I recommend to try it on a test machine first.

There is a sample called NTServiceMonitor Alerter & Recoverer now. It contains a registry file which you can import, or a mwx (RuleSet Only) file which you can attach to your configuration.

This RuleSet can be used as a general NTServiceMonitor Handler.
No matter which which Service fails, the Status will be handled separately for each one.

So you can for example have 50 Services installed on your machine, all will use the same RuleSet without further modifications.
For more details see the comments within the Rules and don't forget to configure the SendEmail Actions to your needs ;).

regards,
Andre
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Continued Problems with Service Monitoring

Postby mgeschke on Thu Dec 02, 2004 7:46 am

Hello again
Thanks for trying to implement this functionality and providing a sample RuleSet.

There are, however, some problems in deploying the ruleset and I had to make some changes:
1. In the InitStatusVar Rule, the %SourceProc% is incorrectly set to "TestOnly"
2. The action that is executed to start the service should read "net send "%SourceProc%" to correctly start services that have a space in their name.

Even with the above changes, however, I was still not able to correctly report on two services that are simultaneously down and I have created a new, simplified RuleSet to find the bug. In concept, the new ruleset is very simple to understand:
a) It creates and sets a Status variable (%SourceProc%) to true if a Service is running
b) It creates and sets a Status variable (%SourceProc%) to false if a service is not running
c) For a service where the Status variable is set to false, it executes the ServiceDownReminder Rule which (apart from writing to the event log) does exactly what the similarly named Rule does in the Sample.

The following problems were found (I will send my ruleset by email):
a) StatusVariables do not appear to be correctly initialised with new values as new services are tested. On writing the %SourceProc% to the Event Log, it appears as if only the first %SourceProc% is ever used...
b) As far as I understand, the "Minimum Wait Time" option applies to the entire Rule and not to a specific "Input". This would then have the effect that the Rule only fires for the first service notification and then "sleeps" for the specified amount of time (irrespective of any other service state changes).

The reason why I first posted this request was exactly because I experienced problems with the b) condition above. Am I perhaps not understanding the capabilities of the Rule engine correctly?

Regards
Mark
mgeschke
 

Postby alorbach on Thu Dec 02, 2004 12:43 pm

First of all the "InitStatusVar" Rule was there for a reason and it also set the Status value to "TestOnly" for a reason. Without that rule, the whole Ruleset won't work, and if you changed it it explains why u run into all that problems.

I had added comments to all rules and rulesets, please take a look to them as well.

The basic ruleset works for sure, I have it already running here - with success.

So please go back to the original state of the RuleSet which is installed with the current RB, only change Email Server, sender and recipient and let it run for some tests

regards,
Andre
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Postby alorbach on Thu Dec 02, 2004 3:01 pm

Ok after some more review, I found out the XML Import/Export Function in the Client has a little bug, which breaks the functionality of the whole RuleSet.

The first Rule "InitStatusVar" has one special Global Filter Condition Option set: "Treat not found filters as TRUE". This little but important setting will initialize the needed dynamic status variables. Unfortunately, the XML Export is missing this Option. Just enable this option and the whole RuleSet will work. Alternative, import the registry file with is available as well.
I apologize that haven't seen this before and accused you of making a fault. :roll:

Regarding the "Minimum Wait Time" Option. This option becomes active after the FIRST time a Rule has been triggered. This means you will get ONE notification, and then in 30 minutes another - if messages are still generated then and filters still evaluate true.
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Postby mgeschke on Thu Dec 02, 2004 3:18 pm

The suggestion to enable the "Treat not found filters as TRUE" does cause the ruleset now to trigger as expected for the first service that fails. However, if another service fails (and cannot be restarted) while this first one is still in the "failed" condition (and is between the first "ServiceDownReminder" and the second one), the following happens:
a) The "RecoverFailed" Rule executes the alert action with the requested information, except that it again uses the name of the first service that failed, i.e. I get notified another time that the first service has failed!
b) The "ServiceDownReminder" is never executed for the second service,i.e. I only get to receive "ServiceDown" reminders for the first service (and this is because it comes before the second service in the alphabet). If I understand correctly (and hence all my questions around this issue), the "Minimum Wait Time" option becomes available after the first time a Rule has been triggered, but it would then be unavailable for any other possible actions pertaining to other Services that might need attention.

Why is it sometimes so difficult to express yourself clearly and unambiguously for all to understand? I hope I have done a better job this time around :(
mgeschke
 

Postby Guest on Thu Dec 02, 2004 3:24 pm

mgeschke wrote:b) The "ServiceDownReminder" is never executed for the second service,i.e. I only get to receive "ServiceDown" reminders for the first service (and this is because it comes before the second service in the alphabet). If I understand correctly (and hence all my questions around this issue), the "Minimum Wait Time" option becomes available after the first time a Rule has been triggered, but it would then be unavailable for any other possible actions pertaining to other Services that might need attention.


I could not yet have a look at the rule set, but I think Mark is right here, Andre. If you use the minimum wait time, the *rule* is disable during that period. So it does not help if you do some more check inside it. Anyhow, I will try to look at the set ASAP, but I thought I should point this out.

Rainer
Guest
 

Postby alorbach on Thu Dec 02, 2004 4:27 pm

First of all, I haven't put much attention into the documentation of the RuleSet, I just created something that works in the first place and updated the RB Setup ASAP with it, so that you could test it.

The RuleSet itself is based on per status variable. As you can see I am using a dynamic status variable everywhere (Which is the Service name).
So it doesn't care if 1,2 or 3 services are down, each one is handled individually.

But right there is a problem with the Service reminder. It will only work for the first failed service, we are using the RuleEngine in a way here as it wasn't supposed to be. So this is actually an issue we have to leave with - for now. It comes onto the todo list, but not on the top.

Except from that, everything else will work as expected - you will get an email when a service is down and when it is recovered.
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Postby rgerhards on Thu Dec 02, 2004 5:38 pm

Just to keep everyone updated. I've had a look at the ruleset and we had a discussion. As of now, there is no way around this trouble. Howerver, we will implement an action that can be used to increment/decrement a status variable. With that, the situation can be addressed (thanks Mark for the suggestion).

Rainer
User avatar
rgerhards
Site Admin
 
Posts: 1067
Joined: Thu Feb 13, 2003 11:57 am

Postby alorbach on Fri Dec 03, 2004 6:16 pm

FYI: The RB has been updated, including a complete new NTService Control Sample: http://forum.adiscon.com/topic,684.html

The Service is no started by the new "Control NT Service" Action, and the reminder is solved using a new Action called "Compute Status Variable". This makes sure the reminder will work for each service individually.
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Postby mgeschke on Fri Dec 03, 2004 8:58 pm

Hi
Thank you very much for the new functionality (Increment Status Variable). This will definitely help to make sure that reminders will work for every service.

I have implemented the modified ruleset and this time the results were much better.
What worked:
- Individual Services triggered the rules
- The Status variables could be incremented
What did not work properly
- The Notifications (writing to the event log) produced the same service information for every single service that triggered a notfication.

Since the functionality in the sample RuleSet is not entirely the functionality that I am looking for, I have decided to simplify the RuleSet to make troubleshooting easier and simultaneously help me to get to my desired functionality. In essence, the Ruleset does the following (I will send the RuleSet directly via email to alorbach from Adiscon):
- Initialise the service status variables to 0
- Test if the service is in an ok state; if yes, terminate rule processing
- Test if the service is in an error state; if yes, increment the previously initialised status variable
- If the status variable is set to 2, write a notification to the Event Log.
- If the status is set to 29, reset the status variable to 0
- Exit Rule Processing

The above rule has the following desired outcome
- To minimise processing requirements; Since this RuleSet is directly connected to a ServiceMonitor, all tests whether this is a service or not have been removed
- Only one Status variable is used per service. Help me if I am wrong on this, but I do not see the need to have two for every service.
- Any restarts of the service have been removed. I only want to be notified.
- The notification does not happen on the first failure (Status=1), but on the second failure. This prevents unneccessary notifications of quick service restarts (as for example is needed when a service is upgraded)
- Subsequent notifications after the first notification, happen every 30 times the rule is executed. If the initial ServiceMonitor is set to 1 minute, then this would be every 30 minutes.

The Problem experienced:
- Everything works perfectly except for the content of the messages written to the Event Log. This is the same for every service failure.

Troubleshooting suggestion:
- Install the previously mentioned RuleSet and connect to a ServiceMonitor that executes every few seconds.
- Run MonitorWare
- Stop one service and observe the message that is written to the Event Log; so far everything should be running ok.
- Stop another service and now observe that exactly the same message as the previous one is written to the Event Log. This happens for every additional service that is stopped.

I hope this helps to troubleshoot this last remaining problem.
Regards
Mark
mgeschke
 

Postby alorbach on Mon Dec 06, 2004 11:44 am

Hi,

this seems to be a bug in the Write NTEvent Action.
I will further analyze it, it will be fixed with the next RB Version.

regards,
Andre
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Postby alorbach on Thu Dec 09, 2004 5:42 pm

Hi,

check out the new RB Version ;)
http://forum.adiscon.com/topic,684.html

regards
Andre
User avatar
alorbach
Site Admin
 
Posts: 812
Joined: Thu Feb 13, 2003 11:55 am

Google Ads



Return to MonitorWare Agent

Who is online

Users browsing this forum: No registered users and 0 guests

cron