Monday 13 February 2012

Tuning Heartbeat alerts in SCOM 2007 R2

What is heartbeat and how it works?

Once a SCOM agent is deployed on a host, e.g. Windows server, it establishes a connection with SCOM MS and sends a Heartbeat packet to it at the specified interval. The purpose of this communication is to let the SCOM MS know that the agent is alive and working, and that the Health service is up and running at the agent’s side. It actually does not report on the health of the server itself or the network link status. In case the Heartbeat packet has not been received for the timeframe defined by “Number of missed Heartbeats allowed” x "Heartbeat interval (seconds)", an alert will be generated to inform that there is a problem with the agent:

Alert: Health Service Heartbeat Failure

Following this, a diagnostic ping will be issued by the SCOM RMS in order to check if the server itself is available and responsive. The ping is a single ICMP packet without any calculations as in a regular ping with ping.exe. If it fails, an alert will be generated:

Alert: Failed to Connect to Computer

This alert actually informs that the server failed to respond to a ping, either due to network or software/hardware issues.


Tuning Heartbeat

By default, a Heartbeat check is set to run at 60 seconds intervals and SCOM MS will tolerate 3 missed responses. If the 4th one is missed as well, the SCOM will generate an alert.

The number of missed heartbeats can be overridden at the management server level and heartbeat interval can be overridden at the agent level.

Heartbeat monitoring can be disabled for all agents or for the following specified agents:

• That connect to the network intermittently.
• That connect to the network over poor connections or use dial-up connections.
• On systems that are frequently restarted.

If there are intermittent issues with the communication between the SCOM server(s) and agents, and it does not affect the end users, alerting can be supressed by decreasing the interval value to something like 15 seconds and increasing the missed responses value to something like 16. This way, the total allowed time-out will still be close to the default value (new: 4.15 mins vs original: 4 mins), but it will allow for more frequent communication attempts which might reset the heartbeat failure counter before an alert is generated.

For more critical servers, the Heartbeat interval can be set to 10 seconds or so, which would result in more aggressive monitoring.

In case these alerts generate a lot of noise after business hours, and alert priority and/or severity is used for alert filtering, it is also possible to create an override and change the following values for the monitor Health Service Heartbeat Failure, so they are not marked as critical events and notifications sent to the after-hours support guys.

Alert Priority: from High to Medium
Alert Severity: from Critical to Warning

Authoring > Monitors > search for Health Service Heartbeat Failure > expand it and under Health Service Watcher (Agent) > Entity Health > Availability, right click Health Service Heartbeat Failure and select Overrides > Override the Monitor > For all objects of class: Health Service Watcher (Agent).

Now, this might cause certain issues as described here when a server gets stuck at loading Windows and responds to pings. To address this, agent status should be examined on a regular basis in the monitoring console as the agents that have issues communicating with the SCOM server will turn grey.


More info:
Heartbeat and Heartbeat Failure Settings in Operations Manager 2007
Health Service Heartbeat Failure, Diagnostics and Recoveries

No comments:

Post a Comment