Thursday 22 November 2012

SCOM 2007 R2: Monitoring and restarting MS SQL backup jobs


As the monitor looks for failed SQL backup job events in Windows application log, it’s a good idea to selectively target the servers and minimise the impact on the server infrastructure. There are 2 ways to achieve this:

1. Create a Windows Simple Event Detection monitor, disable it, and create an override to enable monitoring for a specific group only.

Why targeting a computer group fails?
SCOM 2007: Target a rule or monitor to a computer group

2. Use Authoring Console to create a new class that can be used for discovering servers.

How to monitor a service with unique names across multiple computers using a wildcard?
SCOM: Monitor a custom Services on Windows server 2003

Even though the second option (create a new class) seems a better and cleaner way to go, it’s much easier to manage the membership of a group than to retarget a class (reconfigure discovery). However, I wasn’t able to get a simple recovery task to work when targeting a group through an override (the 1st option), so I proceeded with the 2nd option and created a new class.

I simulated network connectivity issues to find out what happens with an SQL backup job running on SQL 2005 when the network goes down. The following events were logged in the Windows application log:  

Event Type: Error
Event Source: SQL Server service name 
Event ID: 18210
Date: 8/11/2012
Time: 1:25:06 PM 

Event Type: Error
Event Source: SQL Server service name 
Event ID: 3041
Date: 8/11/2012
Time: 1:25:06 PM 

Event Type: Error
Event Source: SQL Server service name
Event ID: 3633
Date: 8/11/2012
Time: 1:25:27 PM 

Event Type: Error
Event Source: SQLISPackage
Event ID: 12291
Date: 8/11/2012
Time: 1:25:27 PM  

Event Type: Warning
Event Source: SQL Server Agent service name
 Event ID: 208
Date: 8/11/2012
Time: 1:25:27 PM
Description: SQL Server Scheduled Job 'Job name' …

The events are always logged in this order, so the last one logged was the warning with ID 208. I configured a monitor to look for the warning event.

If the job completes successfully, the following 2 info events are logged:

Event Type: Information
Event Source: SQL Server service name
Event ID: 18264
Description: Database backed up. Database: ...

Event Type: Information
Event Source: SQLISPackage
Event ID: 12289
Description: Package "job name" finished successfully.

These could be used to close the alert.

Create a new class and discovery, save it as a new MP and import in SCOM. Multiple discoveries can be configured in case a small number of servers needs to be discovered. Otherwise, a more generic filtering approach can be used as explained in

How to monitor a service with unique names across multiple computers using a wildcard

To target several servers by their computer names, simply create a discovery for each server and use FiltereRegistryDiscoveryProvider and the registry key “ComputerName” to discover the specific servers. Set the discovery frequency to not less than 4 hours (24 hours would probably be okay).

Then create a new monitor. I used timer reset (2 mins), but Event ID 12289 “Package "job name" finished successfully “ can be used as well.

The recovery task will run even if an alert previously raised hasn’t been resolved. I tested this by setting timer reset to 15 minutes and then manually running a backup job and simulating network outage several times within 5 minutes. The task restarted the job every time and almost instantly (as soon as the NIC was back).

Store the new monitor in the same MP used for storing the new class, in order for the class to appear in the target list.

Monitor target: your new class
Event Log: Application
Event Expression:
   Event ID: Equals => 208
   Event Source: Equals => SQL Server service name
   EventDescription: Matches wildcard => Job or Maintenance Plan name

As SQL MP alerts on failed backup jobs using the rule “Database Backup Failed To Complete” which reports on Event ID 3041, it’s probably not necessary to turn alerting on for the new monitor.

Create a new “run script” recovery task and use the VB script below. Modify the server name and the job name. Use only the first part of the job name (e.g. without ‘.subplan_1’).

I tested OSQL.exe with a “run command” task, but it didn’t work. The tool worked without problems when executed manually or through a script locally on the DB server. It’s probably possible to configure it using a batch file and a scheduled task, but I haven’t tried that.

Different ways to execute a SQL Agent job
On Error Goto 0: Main() 
Sub Main() 
   Set objSQL = CreateObject("SQLDMO.SQLServer") 
   ' Leave as trusted connection 
   objSQL.LoginSecure = True 
   ' Change to match the name of your SQL server 
   objSQL.Connect "your server name" 
   Set objJob = objSQL.JobServer 
   For each job in objJob.Jobs 
      if instr(1,job.Name,"your job name") > 0 then 
         ' msgbox job.Name 
         job.Start  
         ' msgbox "Job Started"
      end if 
   Next 
End Sub

No comments:

Post a Comment