Quantcast
Channel: Kevin Holman's System Center Blog
Viewing all 179 articles
Browse latest View live

29106 event on RMS – Index was out of range. Wait. What?

$
0
0

Was working with a customer on this one – figured it might help others.

Saw a lot of these VERY SPECIFIC 29106 events on the RMS, specifically with the text: 

System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.

 

Here is the full event:

Event Type:      Warning
Event Source:    OpsMgr Config Service
Event Category:  None
Event ID:        29106
Date:            11/10/2009
Time:            12:43:24 PM
User:            N/A
Computer:        AGENTNAME
Description:
The request to synchronize state for OpsMgr Health Service identified by "3688d65d-a16c-2be6-7e84-5faf8a9cffe0" failed due to the following exception "System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index

What we found was – that we could look up these health service ID’s – by pasting them in the following SQL query:

select * from MTV_HealthService
where BaseManagedEntityId = '3688d65d-a16c-2be6-7e84-5faf8a9cffe0'

This would give us the name of the agent.

In the console, under Agent Managed – we found all of these agents were in “Unmonitored” state – on the agents themselves, they were stuck.  They looked like they got installed, but could not get config.  We deleted them from agent managed, waited a few minutes, and let them show back up in Pending Management.  Approved them – then they were able to come back in and work properly.  These looked for the most part like orphaned machines, and several were computers that were renamed, or old DC’s that were demoted.


The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService

$
0
0

I have written many articles in the past on HealthService restarts.  A HealthService restart is when the agent breaches a pre-set threshold of Memory use, or handle count use, and OpsMgr bounces the agent HealthService to try and correct the condition.

The Past:

Here are a few of the previous articles:

http://blogs.technet.com/kevinholman/archive/2009/03/26/are-your-agents-restarting-every-10-minutes-are-you-sure.aspx

http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx

 

Generally – this is a good thing.  We expect the agent to consume a limited amount of system resources, and if this is ever breached, we assume something is wrong, so we bounce the agent.  The problem is that if an agent NEEDS more resources to do its job – it can get stuck in a bouncing loop every 10-12 minutes, which means there is very little monitoring of that agent going on.  It also can harm the OpsMgr environment, because if this is happening on a large scale, we flood the OpsMgr database with state change events.  You will also see the agent consume a LOT of CPU resources during the startup cycle – because each monitor has to initialize its state at startup, and all discoveries without a specific synch time will run at startup.

 

However, sometimes it is NORMAL for the agent to consume additional resources.  (within reason)

The limits at OpsMgr 2007 RTM were set to 100MB of private bytes, and 2000 handles.  This was enough for the majority of agents out there.  Not all though, especially since the release of Server 2008 OS, and the use of 64bit Operating systems.  Many servers roles require some additional memory, because they run very large discovery scripts, or discovery a very large instance space.  Like DNS servers, because they discover and monitor so many DNS zones.  DHCP servers, because they discover and monitor so many scopes.  Domain controllers, because they can potentially run a lot of monitoring scripts and discovery many AD objects.  SQL servers, because they discover and monitor multiple DB engines, and databases.  Exchange 2007 servers, etc…

 

What’s new:

At the time of this writing, two new management pack updates have been released.  One for SP1, and one for R2.  EVERY customer should be running these MP updates.  I consider them critical to a healthy environment:

R2 MP Update version 6.1.7533.0

SP1 MP Update version 6.0.6709.0

What these MP updates do – is to synchronize both versions of OpsMgr to work exactly the same – and to bump up the resource threshold levels to a more typical amount.  So FIRST – get these imported if you don't have them.  Yes, now.  This alone will solve the majority of HealthService restarts in the wild.  These set the Private Bytes from 300MB (up from 100MB), and the Handle Count to 6000 (up from 2000) for all agents.  This is a MUCH better default setting than we had previously.

 

How can I make it better?

I’m glad you asked!  Well, there are two things you can do, to enhance your monitoring of this very serious condition. 

  1. Add alerting to a HealthService Restart so you can detect this condition when it still exists.
  2. Override these monitors to higher thresholds for specific agents/groups.

Go to the Monitoring pane, Discovered Inventory, and change target type to “Agent”. 

Select any agent preset – and open Health Explorer.

Expand Performance > Health Service Performance > Health Service State.

image

 

This is an aggregate rollup monitor.  If you look at the properties of this top level monitor – you will see the recovery script to bounce the HealthService is on THIS monitor…. it will run in response to ANY of the 4 monitors below it which might turn Unhealthy.

 

image

 

So – we DONT want to set this monitor to also create the alerts.  Because – this monitor can only tell us that “something” was beyond the threshold.  We actually need to set up alerting on EACH of the 4 monitors below it – so we will know if it is a problem with the Healthservice or MonitoringHost, and either memory (private bytes) or Handle Count.

First thing– is to inspect the overrides on each monitor, to make sure you haven't already adjusted this in the past.  ANY specific overrides LESS than the new default of 300MB and 6000 handles should be deleted.  (The exchange MP has a sealed override of 5000 handles and this is fine)

What I like to do – is to add an override, “For all objects of Class”.  Enable “Generates Alert”.  I also ensure that the default value for “Auto-Resolve alert is set to false.  It is critical that auto-resolve is not set to True for this monitor, because we will just close the alert on every agent restart and the alert will be worthless.  What this will do – is generate an alert and never close it, anytime this monitor is unhealthy.  I need to know this information so I can be aware of very specific agents that might require a higher value:

image

 

Repeat this for all 4 monitors.

 

One thing to keep in mind – if you ever need to adjust this threshold for specific agents that are still restarting – 600MB of private bytes (double the default) in generally a good setting.  It is rare to need more than this – unless you have a very specific MP or application that guides you to set this higher for a specific group of agents.

Also – be careful overriding this value across the board… because Management Servers also have a “HealthService” and you could inadvertently set this to be too low for them.  Generally – the default settings are very good now – and you should only be changing this for very specific agents, or a very specific group of agents.

Now – you can uses these alerts to find any problem agents out there.  I really strongly recommend setting this up for any management group out there.  You NEED to know when agents are restarting on their own.

Tuning tip: Do you have monitors constantly “flip flopping” ?

$
0
0

 

This is something I see in almost all clients when we perform a PFE Health Check.  The customer will have lots of data being inserted into the OpsDB from agents, about monitors that are constantly changing state.  This can have a very negative effect on overall performance of the database – because it can be a lot of data, and the RMS is busy handling the state calculation, and synching this data about the state and any alert changes to the warehouse.

Many times the OpsMgr admin has no idea this is happening, because the alerts appear, and then auto-resolve so fast, you never see them – or don’t see them long enough to detect there is a problem.  I have seen databases where the statechangeevent table was the largest in the database – caused by these issues.

 

Too many state changes are generally caused by one or both, of two issues:

1.  Badly written monitors that flip flop constantly.  Normally – this happens when you target a multi-instance perf counter incorrectly.  See my POST on this topic for more information.

2.  HealthService restarts.  See my POST on this topic.

 

How can I detect if this is happening in my environment?

 

That is the right question!  For now – you can run a handful of SQL queries, which will show you the most common state changes going on in your environments.  These are listed on my SQL query blog page in the State section:

 

Noisiest monitors in the database: (Note – these will include old state changes – might not be current)

select distinct top 50 count(sce.StateId) as NumStateChanges, m.MonitorName, mt.typename AS TargetClass
from StateChangeEvent sce with (nolock)
join state s with (nolock) on sce.StateId = s.StateId
join monitor m with (nolock) on s.MonitorId = m.MonitorId
join managedtype mt with (nolock) on m.TargetManagedEntityType = mt.ManagedTypeId
where m.IsUnitMonitor = 1
group by m.MonitorName,mt.typename
order by NumStateChanges desc

 

The above query will show us which monitors are flipping the most in the entire database.  This includes recent, and OLD data.  You have to be careful looking at this output – as you might spent a lot of time focusing on a monitor that had a problem long ago.  You see – we will only groom out old state changes for monitors that are CURRENTLY in a HEALTHY state, AT THE TIME that grooming runs.  We will not groom old state change events if the monitor is Disabled (unmonitored), in Maintenance Mode, Warning State, or Critical State.

What?

This means that if you had a major issue with a monitor in the past, and you solved it by disabling the monitor, we will NEVER, EVER groom that junk out.  This doesn't really pose a problem, it just leaves a little database bloat, and messy statechangeevent views in HealthExplorer.  But the real issue for me is – it makes it a bit tougher to only look at the problem monitors NOW. 

To see if you have really old state change data leftover in your database, you can run the following query:

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

You might find you have a couple YEARS worth of old state data.

So – I have taken the built in grooming stored procedure, and modified the statement to groom out ALL statechange data, and only keep the number of days you have set in the UI.  (The default setting is 7 days).  I like to run this “cleanup” script from time to time, to clear out the old data, and whenever I am troubleshooting current issues with monitor flip-flop.  Here is the SQL query statement:

 

To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state.  By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.

USE [OperationsManager]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN

    SET NOCOUNT ON

    DECLARE @Err int
    DECLARE @Ret int
    DECLARE @DaysToKeep tinyint
    DECLARE @GroomingThresholdLocal datetime
    DECLARE @GroomingThresholdUTC datetime
    DECLARE @TimeGroomingRan datetime
    DECLARE @MaxTimeGroomed datetime
    DECLARE @RowCount int
    SET @TimeGroomingRan = getutcdate()

    SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
    FROM dbo.PartitionAndGroomingSettings
    WHERE ObjectName = 'StateChangeEvent'

    EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
    SET @Err = @@ERROR

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    SET @RowCount = 1  

    -- This is to update the settings table
    -- with the max groomed data
    SELECT @MaxTimeGroomed = MAX(TimeGenerated)
    FROM dbo.StateChangeEvent
    WHERE TimeGenerated < @GroomingThresholdUTC

    IF @MaxTimeGroomed IS NULL
        GOTO Success_Exit

    -- Instead of the FK DELETE CASCADE handling the deletion of the rows from
    -- the MJS table, do it explicitly. Performance is much better this way.
    DELETE MJS
    FROM dbo.MonitoringJobStatus MJS
    JOIN dbo.StateChangeEvent SCE
        ON SCE.StateChangeEventId = MJS.StateChangeEventId
    JOIN dbo.State S WITH(NOLOCK)
        ON SCE.[StateId] = S.[StateId]
    WHERE SCE.TimeGenerated < @GroomingThresholdUTC
    AND S.[HealthState] in (0,1,2,3)

    SELECT @Err = @@ERROR
    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    WHILE (@RowCount > 0)
    BEGIN
        -- Delete StateChangeEvents that are older than @GroomingThresholdUTC
        -- We are doing this in chunks in separate transactions on
        -- purpose: to avoid the transaction log to grow too large.
        DELETE TOP (10000) SCE
        FROM dbo.StateChangeEvent SCE
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE TimeGenerated < @GroomingThresholdUTC
        AND S.[HealthState] in (0,1,2,3)

        SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

        IF (@Err <> 0)
        BEGIN
            GOTO Error_Exit
        END
    END   

    UPDATE dbo.PartitionAndGroomingSettings
    SET GroomingRunTime = @TimeGroomingRan,
        DataGroomedMaxTime = @MaxTimeGroomed
    WHERE ObjectName = 'StateChangeEvent'

    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END 
Success_Exit:
Error_Exit:   
END

 

Once this is cleaned up – you can re-run the DATEDIFF query – and see you should only have the same number of days as set in your UI retention setting for database grooming.

Now – you can run the “Most common state changes” query – and identify which monitors are causing the problem.

 

Look for monitors at the top with MUCH higher numbers than all others.  This will be “monitor flip flop” and you should use Health Explorer to find that monitor on a few instances – and figure out why it is changing state so much in the past few days.  Common conditions for this one are badly written monitors that target a single instance object, but monitor a multi-instance perf counter.  You can read more on that HERE.  Also – just poor overall tuning can cause this – or poorly written custom script based monitors.

If you see a LOT of similar monitors at the top, with very similar state change counts, this is often indicative of HealthService restarts.  The Health service will submit new state change data every time it starts up.  So if the agent is bouncing every 10 minutes, that is a new state change for ALL monitors on that agent, every 10 minutes.  You can read more about this condition at THIS blog post.

How to get your agents back to “Remotely Manageable” in OpsMgr 2007 R2

$
0
0

 

You may notice that there are actions you might want to take on an agent, that are grayed out and not available in the console.

There actions might include:

  • Change Primary Management Server
  • Repair
  • Uninstall

See the below image for an example:

image

 

This is caused by a flag in the database, which has marked that particular agent as “Not Remotely Manageable”… or “IsManuallyInstalled”.

 

In order to better see this setting in the UI – you need to personalize the “Agent Managed” view.  Right click in the header bar at the top of the “Agent Managed” view, near where it says “Health State” and choose “Personalize View”

image

 

In the view options – add the “Remotely Manageable” column:

 

image

 

Now – you can sort by this column and easily find the agents that you have no control over in the console:

 

image

 

***Another thing to note– is if the “Remotely Manageable” flag is set to “No”… we will NOT put those agents into “Pending Management” for a hotfix (when a SCOM hotfix that should also be delivered to agents is applied to a management server).  This is by design.

 

 

Now…. the question is – WHY are there systems with this flag set to NO?

These MIGHT be unavailable to you for a very good reason….  Basically – for ANY agent that was manually installed – and you ever had to “Approve” the agent – we will set Remotely Manageable to No, by design.  The thought process behind this, is that if an agent is manually installed…. we assume it is that way for a reason, and we don't want to *break* anything by controlling it from the UI moving forward.

Here are some examples of manually installed agents that should NOT be controlled in the UI:

  • AD integrated agents.  If you are using Active Directory integration to assign an agent to specific management servers – you don't want to ever break this by changing the management server manually, or running a repair, as this will break AD integration.
  • Agents behind a firewall, that cannot be repaired… or that only have ports opened to specific management servers.  If you had multiple management servers, and only allowed a specific agent access to one of them in a firewall – if you manually changed the MS you could orphan the agent.

Now– for most customers I work with – the two issues above don't apply.  If they do – then DON’T change the Remotely Manageable flag!

 

However – for many customers, the issues above do not apply…. and they end up with a large number of agents that get this flag inadvertently set to “No”.  They do not desire this behavior.  Here is what can happen to set this flag to No… and when this will be undesirable:

  • Sometimes you will be troubleshooting a (previously) push installed agent – but will delete the agent under “Agent Managed”… and let it re-detect, and then approve it.  SCOM will now treat that agent as manually installed and flag it as such in the database/console.
  • Sometimes you will have a troublesome agent that will not push deploy for some reason, so you manually install/approve a handful of those.
  • Sometimes you are having issues getting an agent to work, and in the troubleshooting process, you manually uninstall/reinstall/approve the agent as a quick fix.

In these cases…. we really need a way to “force” this Remotely Manageable flag back to “Yes” when we understand the issue, and know why it got flagged as “No”….. but desire future ability to repair, uninstall, change MS, and put into pending actions for a hotfix down the road.

 

Unfortunately– the only way to do that today is via a database edit.  However, it is relatively simple to do.

 

Below are the queries to better understand this, and modify your agents.  Remember – DON’T do this IF you are using AD integration or have agents that you require to be left alone.

 

Here is a query just to SEE Agents which are set as manually installed:

select bme.DisplayName from MT_HealthService mths
INNER JOIN BaseManagedEntity bme on bme.BaseManagedEntityId = mths.BaseManagedEntityId
where IsManuallyInstalled = 1

 

Here is a query that will setALLagents back to Remotely Manageable:

UPDATE MT_HealthService
SET IsManuallyInstalled=0
WHERE IsManuallyInstalled=1

 

Now – the above query will set ALL agents back to “Remotely Manageable = Yes” in the console.  If you want to control it agent by agent – you need to specify it by name here:

UPDATE MT_HealthService
SET IsManuallyInstalled=0
WHERE IsManuallyInstalled=1
AND BaseManagedEntityId IN
(select BaseManagedEntityID from BaseManagedEntity
where BaseManagedTypeId = 'AB4C891F-3359-3FB6-0704-075FBFE36710'
AND DisplayName = 'agentname.domain.com')

 

So – I change my agents back to Remotely Manageable…. and refresh my console Agent Managed View…. and voila!  I can now modify the MS, repair, uninstall, etc:

 

image

System Center Universe is coming – January 19th!

$
0
0

 

REGISTER NOW HERE:  http://www.systemcenteruniverse.com/

image

 

Read Cameron Fuller’s blog post on this here:  http://blogs.catapultsystems.com/cfuller/archive/2015/12/17/scuniverse-returns-to-dallas-tx-and-the-world-on-january-19th-2016/

 

 

SCU is an awesome day of sessions covering Microsoft System Center, Windows Server, and Azure technologies from top speakers including Microsoft experts and MVP’s in the field.

There are two tracks depending on your interests – Cloud and Datacenter Management, and Enterprise Client Management.

The sponsors for 2016 include:

  • Catapult Systems
  • Microsoft
  • Veeam
  • Adaptiva
  • Secunia
  • Heat Software
  • MPx Alliance
  • Squared Up
  • Cireson

If you cannot attend in person – you can still attend via simulcast!  If you want to attend virtually, there are user group based simulcast locations around the world. Registration is available at: http://www.systemcenteruniverse.com/venue.htm

Simulcast event locations include:

  • Austin, TX
  • Denver, CO
  • Houston, TX
  • Omaha, NE
  • Phoenix, AZ
  • San Antonio, TX
  • Seattle, WA
  • Tampa, FL
  • Amsterdam
  • Germany
  • Vienna
  • And of course our event location in Dallas, TX!

If you want to attend, the in-person event it is available in Dallas Texas and registration is available at: https://www.eventbrite.com/e/scu-2016-live-tickets-7970023555

UR8 for SCOM 2012 R2 – Step by Step

$
0
0

 

image

 

NOTE:  I get this question every time we release an update rollup:   ALL SCOM Update Rollups are CUMULATIVE.  This means you do not need to apply them in order, you can always just apply the latest update.  If you have deployed SCOM 2012R2 and never applied an update rollup – you can go strait to the latest one available.  If you applied an older one (such as UR3) you can always go straight to the latest one!

 

 

KB Article for OpsMgr:  https://support.microsoft.com/en-us/kb/3096382

KB Article for all System Center components:  https://support.microsoft.com/en-us/kb/3096378

Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=3096382

 

Key fixes:

  • Slow load of alert view when it is opened by an operator
    Sometimes when the operators change between alert views, the views take up to two minutes to load. After this update rollup is installed, the reported performance issue is eradicated. The Alert View Load for the Operator role is now almost same as that for the Admin role user.
  • SCOMpercentageCPUTimeCounter.vbs causes enterprise wide performance issue
    Health Service encountered slow performance every five to six (5-6) minutes in a cyclical manner. This update rollup resolves this issue.
  • System Center Operations Manager Event ID 33333 Message: The statement has been terminated.
    This change filters out "statement has been terminated" warnings that SQL Server throws. These warning messages cannot be acted on. Therefore, they are removed.
  • System Center 2012 R2 Operations Manager: Report event 21404 occurs with error '0x80070057' after Update Rollup 3 or Update Rollup 4 is applied.
    In Update Rollup 3, a design change was made in the agent code that regressed and caused SCOM agent to report error ‘0x80070057’ and MonitoringHost.exe to stop responding/crash in some scenarios. This update rollup rolls back that UR3 change.
  • SDK service crashes because of Callback exceptions from event handlers being NULL
    In a connected management group environment in certain race condition scenarios, the SDK of the local management group crashes if there are issues during the connection to the different management groups. After this update rollup is installed, the SDK of the local management group should no longer crash.
  • Run As Account(s) Expiring Soon — Alert does not raise early enough
    The 14-day warning for the RunAs account expiration was not visible in the SCOM console. Customers received only an Error event in the console three days before the account expiration. After this update rollup is installed, customers will receive a warning in their SCOM console 14 days before the RunAs account expiration, and receive an Error event three (3) days before the RunAs account expiration.
  • Network Device Certification
    As part of Network device certification, we have certified the following additional devices in Operations Manager to make extended monitoring available for them:
    • Cisco ASA5515
    • Cisco ASA5525
    • Cisco ASA5545
    • Cisco IPS 4345
    • Cisco Nexus 3172PQ
    • Cisco ASA5515-IPS
    • Cisco ASA5545-IPS
    • F5 Networks BIG-IP 2000
    • Dell S4048
    • Dell S3048
    • Cisco ASA5515sc
    • Cisco ASA5545sc
  • French translation of APM abbreviation is misleading
    The French translation of “System Center Management APM service” is misleading. APM abbreviation is translated incorrectly in the French version of Microsoft System Center 2012 R2 Operations Manager. APM means “Application Performance Monitoring” but is translated as “Advanced Power Management." This fix corrects the translation.
  • p_HealthServiceRouteForTaskByManagedEntityId does not account for deleted resource pool members in System Center 2012 R2 Operations Manager
    If customers use Resource Pools and take some servers out of the pool, discovery tasks start failing in some scenarios. After this update rollup is installed, these issues are resolved.
  • Exception in the 'Managed Computer' view when you select Properties of a managed server in Operations Manager Console
    In the Operations Manager Server “Managed Computer” view on the Administrator tab, clicking the “Properties” button of a management server causes an error. After this update rollup is installed, a dialog box that contains a “Heart Beat” tab is displayed.
  • Duplicate entries for devices when network discovery runs
    When customers run discovery tasks to discover network devices, duplicate network devices that have alternative MAC addresses are discovered in some scenarios. After this update rollup is installed, customers will not receive any duplicate devices discovered in their environments.
  • Preferred Partner Program in Administration Pane
    This update lets customers view certified System Center Operations Manager partner solutions directly from the console. Customers can obtain an overview of the partner solutions and visit the partner websites to download and install the solutions.
There are no updates for Linux, and there are no updated MP’s for Linux in this update.

 

Lets get started.

From reading the KB article – the order of operations is:

  1. Install the update rollup package on the following server infrastructure:
    • Management servers
    • Gateway servers
    • Web console server role computers
    • Operations console role computers
  2. Apply SQL scripts.
  3. Manually import the management packs.
  4. Update Agents

Now, NORMALLY we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.   However, in UR8 for SCOM 2012 R2, there are no updates for Linux

 

 

 

1.  Management Servers

image

Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

image

Then extract the contents:

image

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

image

This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update usually does not provide any feedback that it had success or failure. 

I got a prompt to restart:

image

I choose yes and allow the server to restart to complete the update.

 

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Event ID:      1036
Level:         Information
Computer:      SCOM01.opsmgr.net
Description:
Windows Installer installed an update. Product Name: System Center Operations Manager 2012 Server. Product Version: 7.1.10226.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2012 R2 Operations Manager UR8 Update Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute. 

image

Next up – run the Web Console update:

image

This runs much faster.   A quick file spot check:

image

Lastly – install the console update (make sure your console is closed):

image

A quick file spot check:

image

 

 

Secondary Management Servers:

image

I now move on to my secondary management servers, applying the server update, then the console update. 

On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files.  I check online, and make sure that I have configured Windows Update to give me updates for additional products: 

Apparently when I tried this – the catalog was broken – because none of the system center stuff was showing up in Windows Updates.

So….. because of this – I elect to do manual updates like I did above.

I apply these updates, and reboot each management server, until all management servers are updated.

 

 

 

Updating Gateways:

image

I can use Windows Update or manual installation.

image

The update launches a UI and quickly finishes.

Then I will spot check the DLL’s:

image

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

image

 

 

 

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there are two SQL script files: 

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

image

First – let’s run the script to update the OperationsManager database.  Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR.  The script body can change so as a best practice always re-run this.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.  I have had customers state this takes from a few minutes to as long as an hour. In MOST cases – you will need to shut down the SDK, Config, and Monitoring Agent (healthservice) on ALL your management servers in order for this to be able to run with success.

You will see the following (or similar) output:

image47

or

image

IF YOU GET AN ERROR – STOP!  Do not continue.  Try re-running the script several times until it completes without errors.  In a production environment, you almost certainly have to shut down the services (sdk, config, and healthservice) on your management servers, to break their connection to the databases, to get a successful run.

Technical tidbit:   Even if you previously ran this script in UR1, UR2, UR3, UR4, UR5, UR6, or UR7, you should run this again for UR8, as the script body can change with updated UR’s.

image

Next, we have a script to run against the warehouse DB.  Do not skip this step under any circumstances.    From:

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

Open a SQL management studio query window, connect it to your OperationsManagerDW database, and then open the script file UR_Datawarehouse.sql.  Make sure it is pointing to your OperationsManagerDW database, then execute the script.

If you see a warning about line endings, choose Yes to continue.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output:

image

 

 

 

3. Manually import the management packs

image

There are 26 management packs in this update!

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific.  Only import the ones you need, and that are correct for your language.  I will remove all the Advisor MP’s for other languages, and I am left with the following:

image

The TFS MP bundles are only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

The Advisor MP’s are only needed if you are using Microsoft Operations Management Suite cloud service, (Previously known as Advisor, and Operation Insights).

However, the Image and Visualization libraries deal with Dashboard updates, and these always need to be updated.

I import all of these shown without issue.

 

 

4.  Update Agents

image43_thumb

Agents should be placed into pending actions by this update (mine worked great) for any agent that was not manually installed (remotely manageable = yes):   One the Management servers where I used Windows Update to patch them, their agents did not show up in this list.  Only agents where I manually patched their management server showed up in this list.  FYI.

image

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending.

In this case – my agents that were reporting to a management server that was updated using Windows Update – did NOT place agents into pending.  Only the agents reporting to the management server for which I manually executed the patch worked.

You can approve these – which will result in a success message once complete:

image

Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

image

 

 

 

5.  Update Unix/Linux MPs and Agents

image

There are no updates for Linux in UR8.  Please see the instructions for UR7 if you are not updating from UR7 directly:

http://blogs.technet.com/b/kevinholman/archive/2015/08/17/ur7-for-scom-2012-r2-step-by-step.aspx

 

 

6.  Update the remaining deployed consoles

image

This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the matching update version.

 

 

 

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

image

Known issues:

See the existing list of known issues documented in the KB article.

1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop all the SCOM services on the management servers, and/or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

——————————————————
(1 row(s) affected)
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
——————————————————–

Writing a service recovery script – Cluster service example

$
0
0

 

I had a customer request the ability to monitor the cluster service on clusters, and ONLY alert when a recovery attempt failed.

This is a fairly standard request for service monitoring when we use recoveries – we generally don’t want an alert to be generated from the Service Monitor, because that will be immediate upon service down detection.  We want the service monitor to detect the service down, then run a recovery, and then if the recovery fails to restore service, generate an alert.

Here is an example of that.

The cluster service monitor is unique, in that it already has a built in recovery.  However, it is too simple for our needs, as it only runs NET START.

image

 

So the first thing we will need to do, is create an override disabling this built in recovery:

image

 

Next – override the “Cluster service status” monitor to not generate alerts:

image

 

Now we can add our own script base recovery to the monitor:

image

 

image

 

And paste in a script which I will provide below.  Here is the script:

'========================================================================== ' ' COMMENT: This is a recovery script to recovery the Cluster Service ' '========================================================================== Option Explicit SetLocale("en-us") Dim StartTime,EndTime,sTime 'Capture script start time StartTime = Now 'Time that the script starts so that we can see how long it has been watching to see if the service stops again. Dim strTime strTime = Time Dim oAPI Set oAPI = CreateObject("MOM.ScriptAPI") Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3750,0,"Service Recovery script is starting") Dim strComputer, strService, strStartMode, strState, objCount, strClusterService 'The script will always be run on the machine that generated the monitor error strComputer = "." strClusterService = "ClusSvc" 'Record the current state of each service before recovery in an event Dim strClusterServicestate ServiceState(strClusterService) strClusterServicestate = strState Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3751,0,"Current service state before recovery is: " & strClusterService & " : " & strClusterServicestate) 'Stop script if all services are running If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3752,2,"All services were found to be already running, recovery should not run, ending script") Wscript.Quit End If 'Check to see if a specific event has been logged previously that means this recovery script should NOT run if event is present 'This section optional and not commonly used Dim dtmStartDate, iCount, colEvents, objWMIService, objEvent ' Const CONVERT_TO_LOCAL_TIME = True ' Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") ' dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME ' ' iCount = 0 ' Set objWMIService = GetObject("winmgmts:" _ ' & "{impersonationLevel=impersonate,(Security)}!\\" _ ' & strComputer & "\root\cimv2") ' Set colEvents = objWMIService.ExecQuery _ ' ("Select * from Win32_NTLogEvent Where Logfile = 'Application' and " _ ' & "TimeWritten > '" & dtmStartDate & "' and EventCode = 100") ' For Each objEvent In colEvents ' iCount = iCount+1 ' Next ' If iCount => 1 Then ' EndTime = Now ' sTime = DateDiff("s", StartTime, EndTime) ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3761,2,"script found event which blocks execution of this recovery. Recovery will not run. Script ending after " & sTime & " seconds") ' WScript.Quit ' ElseIf iCount < 1 Then ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3762,0,"script did not find any blocking events. Script will continue") ' End If 'At least one service is stopped to cause this recovery, stopping all three services so we can start them in order 'You would only use this section if you had multiple services and they needed to be started in a specific order ' Call oAPI.LogScriptEvent("ServiceRecovery.vbs",3753,0,"At least one service was found not running. Recovery will run. Attempting to stop all services now") ' ServiceStop(strService1) ' ServiceStop(strService2) ' ServiceStop(strService3) 'Check to make sure all services are actually in stopped state ' Optional Wait 15 seconds for slow services to stop ' Wscript.Sleep 15000 ServiceState(strClusterService) strClusterServicestate = strState 'Stop script if all services are not stopped If (strClusterServicestate <> "Stopped") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3754,2,"Recovery script found service is not in stopped state. Manual intervention is required, ending script. Current service state is: " & strClusterService & " : " & strClusterServicestate) Wscript.Quit Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3755,0,"Recovery script verified all services in stopped state. Continuing.") End If 'Start services in order. Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3756,0,"Attempting to start all services") Dim errReturn 'Restart Services and watch to see if the command executed without error ServiceStart(strClusterService) Wscript.sleep 5000 'Check service state to ensure all services started ServiceState(strClusterService) strClusterServicestate = strState 'Log success or fail of recovery If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3757,0,"All services were successfully started and then found to be running") Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3758,2,"Recovery script failed to start all services. Manual intervention is required. Current service state is: " & strClusterService & " : " & strClusterServicestate) End If 'Check to see if this recovery script has been run three times in the last 60 minutes for loop detection Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME iCount = 0 Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate,(Security)}!\\" _ & strComputer & "\root\cimv2") Set colEvents = objWMIService.ExecQuery _ ("Select * from Win32_NTLogEvent Where Logfile = 'Operations Manager' and " _ & "TimeWritten > '" & dtmStartDate & "' and EventCode = 3750") For Each objEvent In colEvents iCount = iCount+1 Next If iCount => 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3759,2,"script restarted " & strClusterService & " service 3 or more times in the last hour, script ending after " & sTime & " seconds") WScript.Quit ElseIf iCount < 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3760,0,"script restarted " & strClusterService & " service less than 3 times in the last hour, script ending after " & sTime & " seconds") End If Wscript.Quit '================================================================================== ' Subroutine: ServiceState ' Purpose: Gets the service state and startmode from WMI '================================================================================== Sub ServiceState(strService) Dim objWMIService, colRunningServices, objService Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colRunningServices = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name = '"& strService & "'") For Each objService in colRunningServices strState = objService.State strStartMode = objService.StartMode Next End Sub '================================================================================== ' Subroutine: ServiceStart ' Purpose: Starts a service '================================================================================== Sub ServiceStart(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StartService() Next End Sub '================================================================================== ' Subroutine: ServiceStop ' Purpose: Stops a service '================================================================================== Sub ServiceStop(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StopService() Next End Sub

 

Here it is inserted into the UI.  I provide a 3 minute timeout for this one:

 

image

 

Here is how it will look once added:

image

 

Now – we need to generate an alert when the script detects that it failed to start the service:

image

 

Provide a name and we will target the same class as the service monitor:

image

 

For the expression – the ID comes from the event generated by the recovery script, and the string search makes sure we are only alerting on a Cluster service recovery, if we reuse the script for other services we need to be able to distinguish from them:

image

 

 

Lets test!

If we just simply stop the Cluster Service – the recovery kicks in and see evidence in the state changes, and event log:

 

image

 

I like REALLY verbose logging in the scripts I write…. more is MUCH better than less especially when troubleshooting, and recoveries should not be running often clogging up the logs.

image

image

image

image

 

image

image

 

 

If the recovery fails to start the service – the script detects this – drops a very specific event, and then an alert is generated for the service being down and manual intervention required:

 

image

 

image

 

 

There we have it – we only get alerts if the service is not recoverable.  This makes SCOM more actionable.  If we want a record of this for reporting – we can collect the events for recovery starting, and then report on those events.

You can download this example MP at:

https://gallery.technet.microsoft.com/Cluster-Service-Recovery-270ca2cd

UR9 for SCOM 2012 R2 – Step by Step

$
0
0

 

 

 

image48

 

NOTE:  I get this question every time we release an update rollup:   ALL SCOM Update Rollups are CUMULATIVE.  This means you do not need to apply them in order, you can always just apply the latest update.  If you have deployed SCOM 2012R2 and never applied an update rollup – you can go strait to the latest one available.  If you applied an older one (such as UR3) you can always go straight to the latest one!

 

 

KB Article for OpsMgr:  https://support.microsoft.com/en-us/kb/3129774

Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=3129774

 

Key fixes:

  • SharePoint workflows fail with an access violation under APM
    A certain sequence of the events may trigger an access violation in APM code when it tries to read data from the cache during the Application Domain unload. This fix resolves this kind of behavior.
  • Application Pool worker process crashes under APM with heap corruption
    During the Application Domain unload two threads might try to dispose of the same memory block leading to DOUBLE FREE heap corruption. This fix makes sure that memory is disposed of only one time.
  • Some Application Pool worker processes become unresponsive if many applications are started under APM at the same time
    Microsoft Monitoring Agent APM service has a critical section around WMI queries it performs. If a WMI query takes a long time to complete, many worker processes are waiting for the active one to complete the call. Those application pools may become unresponsive, depending on the wait duration. This fix eliminates the need in WMI query and significantly improves the performance of this code path.
  • MOMAgent cannot validate RunAs Account if only RODC is available
    If there's a read-only domain controller (RODC), the MonAgent cannot validate the RunAs account. This fix resolves this issue.
  • Missing event monitor does not warn within the specified time range in SCOM 2012 R2 the first time after restart
    When you create a monitor for a missed event, the first alert takes twice the amount of time specified time in the monitor. This fix resolves the issue, and the alert is generated in the time specified.
  • SCOM cannot verify the User Account / Password expiration date if it is set by using Password Setting object
    Fine grained password policies are stored in a different container from the user object container in Active Directory. This fix resolves the problems in computing resultant set of policy (RSOP) from these containers for a user object.
  • SLO Detail report displays histogram incorrectly
    In some specific scenarios, the representation of the downtime graph is not displayed correctly. This fix resolves this kind of behavior.
  • APM support for IIS 10 and Windows Server 2016
    Support of IIS 10 on Windows Server 2016 is added for the APM feature in System Center 2012 R2 Operations Manager. An additional management pack Microsoft.SystemCenter.Apm.Web.IIS10.mp is required to enable this functionality. This management pack is located in %SystemDrive%\Program Files\System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups alongside its dependencies after the installation of Update Rollup 9.
    Important Note One dependency is not included in Update Rollup 9 and should be downloaded separately:

    Microsoft.Windows.InternetInformationServices.2016.mp

  • APM Agent Modules workflow fail during workflow shutdown with Null Reference Exception
    The Dispose() method of Retry Manager of APM connection workflow is executed two times during the module shutdown. The second try to execute this Dispose() method may cause a Null Reference Exception. This fix makes sure that the Dispose() method can be safely executed one or more times.
  • AEM Data fills up SCOM Operational database and is never groomed out
    If you use SCOM’s Agentless Exception Monitoring to examine application crash data and report on it, the data never grooms out of the SCOM Operational database. The problem with this is that soon the SCOM environment will be overloaded with all the instances and relationships of the applications, error groups, and Windows-based computers, all which are hosted by the management servers. This fix resolves this issue. Additionally, the following management pack’s must be imported in the following order:
    • Microsoft.SystemCenter.ClientMonitoring.Library.mp
    • Microsoft.SystemCenter.DataWarehouse.Report.Library.mp
    • Microsoft.SystemCenter.ClientMonitoring.Views.Internal.mp
    • Microsoft.SystemCenter.ClientMonitoring.Internal.mp
  • The DownTime report from the Availability report does not handle the Business Hours settings
    In the downtime report, the downtime table was not considering the business hours. This fix resolves this issue and business hours will be shown based on the specified business hour values.
    The updated RDL files are located in the following location:

    %SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Reporting

    To update the RDL file, follow these steps:

    1. Go to http://MachineName/Reports_INSTANCE1/Pages/Folder.aspxMachineName //Reporting Server.
    2. On this page, go to the folder to which you want to add the RDL file. In this case, click Microsoft.SystemCenter.DataWarehouse.Report.Library.
    3. Upload the new RDL files by clicking the upload button at the top. For more information, see https://msdn.microsoft.com/en-us/library/ms157332.aspx.
  • Adding a decimal sign in an SLT Collection Rule SLO in the ENU Console on a non-ENU OS does not work
    You run the System Center 2012 R2 Operations Manager Console in English on a computer that has the language settings configured to use a non-English (United States) language that uses a comma (,) as the decimal sign instead of a period (.). When you try to create Service Level Tracking, and you want to add a Collection Rule SLO, the value you enter as the threshold cannot be configured by using a decimal sign. This fix resolves the issue.
  • SCOM Agent issue while logging Operations Management Suite (OMS) communication failure
    An issue occurs when OMS communication failures are logged. This fix resolves this issue.

 

There are no updates for Linux, and there are no updated MP’s for Linux in this update as of this time.  The most current Linux MP’s are available below in the Linux section

 

Lets get started.

From reading the KB article – the order of operations is:

  1. Install the update rollup package on the following server infrastructure:
    • Management servers
    • Gateway servers
    • Web console server role computers
    • Operations console role computers
  2. Apply SQL scripts.
  3. Manually import the management packs.
  4. Update Agents

Now, NORMALLY we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.   However, in UR8 and UR9 for SCOM 2012 R2, there are no updates for Linux

 

 

 

1.  Management Servers

image

Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

Then extract the contents:

image

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

image

This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update usually does not provide any feedback that it had success or failure. 

I got a prompt to restart:

image

I choose yes and allow the server to restart to complete the update.

 

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Date:          1/27/2016 9:37:28 AM
Event ID:      1036
Description:
Windows Installer installed an update. Product Name: System Center Operations Manager 2012 Server. Product Version: 7.1.10226.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2012 R2 Operations Manager UR9 Update Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute. 

image

Next up – run the Web Console update:

image

This runs much faster.   A quick file spot check:

image

Lastly – install the console update (make sure your console is closed):

image

A quick file spot check:

image

 

 

Additional Management Servers:

image

I now move on to my additional management servers, applying the server update, then the console update and web console update where applicable.

On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files.  I check online, and make sure that I have configured Windows Update to give me updates for additional products: 

image

The applicable updates show up under optional – so I tick the boxes and apply these updates.

After a reboot – go back and verify the update was a success by spot checking some file versions like we did above.

 

 

Updating Gateways:

image

I can use Windows Update or manual installation.

image

The update launches a UI and quickly finishes.

Then I will spot check the DLL’s:

image

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

image

 

***NOTE:  You can delete any older UR update files from the \AgentManagement directories.  The UR’s do not clean these up and they provide no purpose for being present any longer.

 

 

 

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there are two SQL script files: 

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

image

First – let’s run the script to update the OperationsManager database.  Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR.  The script body can change so as a best practice always re-run this.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.  I have had customers state this takes from a few minutes to as long as an hour. In MOST cases – you will need to shut down the SDK, Config, and Monitoring Agent (healthservice) on ALL your management servers in order for this to be able to run with success.

You will see the following (or similar) output:

image47

or

image

IF YOU GET AN ERROR – STOP!  Do not continue.  Try re-running the script several times until it completes without errors.  In a production environment, you almost certainly have to shut down the services (sdk, config, and healthservice) on your management servers, to break their connection to the databases, to get a successful run.

Technical tidbit:   Even if you previously ran this script in UR1, UR2, UR3, UR4, UR5, UR6, UR7, or UR8, you should run this again for UR9, as the script body can change with updated UR’s.

image

Next, we have a script to run against the warehouse DB.  Do not skip this step under any circumstances.    From:

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

Open a SQL management studio query window, connect it to your OperationsManagerDW database, and then open the script file UR_Datawarehouse.sql.  Make sure it is pointing to your OperationsManagerDW database, then execute the script.

If you see a warning about line endings, choose Yes to continue.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output:

image

 

 

 

3. Manually import the management packs

image

There are 55 management packs in this update!   Most of these we don’t need – so read carefully.

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific.  Only import the ones you need, and that are correct for your language.  I will remove all the MP’s for other languages (keeping only ENU), and I am left with the following:

image

 

What NOT to import:

The Advisor MP’s are only needed if you are using Microsoft Operations Management Suite cloud service, (Previously known as Advisor, and Operation Insights).

The APM MP’s are only needed if you are using the APM feature in SCOM.

Note the APM MP with a red X.  This MP requires the IIS MP’s for Windows Server 2016 which are in Technical Preview at the time of this writing.  Only import this if you are using APM *and* you need to monitor Windows Server 2016.  If so, you will need to download and install the technical preview editions of that MP from https://www.microsoft.com/en-us/download/details.aspx?id=48256

The TFS MP bundle is only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

However, the Image and Visualization libraries deal with Dashboard updates, and these always need to be updated.

I import all of these shown without issue.

 

 

4.  Update Agents

image43_thumb

Agents should be placed into pending actions by this update for any agent that was not manually installed (remotely manageable = yes):  

 

One the Management servers where I used Windows Update to patch them, their agents did not show up in this list.  Only agents where I manually patched their management server showed up in this list.  FYI.   The experience is NOT the same when using Windows Update vs manual.  If yours don’t show up – you can try running the update for that management server again – manually.

image

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending.

In this case – my agents that were reporting to a management server that was updated using Windows Update – did NOT place agents into pending.  Only the agents reporting to the management server for which I manually executed the patch worked.

I manually re-ran the server MSP file manually on these management servers, from an elevated command prompt, and they all showed up:

image

 

 

You can approve these – which will result in a success message once complete:

image

Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

image

 

 

 

5.  Update Unix/Linux MPs and Agents

image

There are no updates for Linux in UR9 at the time of this writing.   The current Linux MP’s can be downloaded from:

https://www.microsoft.com/en-us/download/details.aspx?id=29696

7.5.1045.0 is current at this time for SCOM 2012 R2 and these shipped with UR7.  If you are already running 7.5.1045.0 version of the Linux MP’s and agents – no update is necessary.

****Note – take GREAT care when downloading – that you select the correct download for R2.  You must scroll down in the list and select the MSI for 2012 R2:

image

Download the MSI and run it.  It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 R2 Management Packs for Unix and Linux\

Update any MP’s you are already using.   These are mine for RHEL, SUSE, and the Universal Linux libraries. 

image

You will likely observe VERY high CPU utilization of your management servers and database server during and immediately following these MP imports.  Give it plenty of time to complete the process of the import and MPB deployments.

Next up – you would upgrade your agents on the Unix/Linux monitored agents.  You can now do this straight from the console:

image

image

You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.

Mine FAILED, with an SSH exception about copying the new agent.  It turns out my files were not updated on the management server – see pic:

image

I had to restart the Healthservice on the management server, and within a few minutes all the new files were there.

Finally:

image

 

 

6.  Update the remaining deployed consoles

image

This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the matching update version.

 

 

 

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

image

Known issues:

See the existing list of known issues documented in the KB article.

1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop all the SCOM services on the management servers, and/or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

——————————————————
(1 row(s) affected)
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
——————————————————–


Removing / Migrating old Management Servers to new ones

$
0
0

 

This is a common practice for rotating old physical servers coming off lease, or when moving VM based management servers to a new operating system. 

 

There are some generic instructions on TechNet here:  https://technet.microsoft.com/en-us/library/hh456439.aspx   however, these don’t really paint the whole picture of what all should be checked first.  Customers sometimes run into orphaned objects, or management servers they cannot delete because the MS is hosting remote monitoring activities.

Here is a checklist I have put together, the steps are not necessarily enforced in this order… so you can rearrange much of this as you see fit.

 

  • Install new management server(s)
  • Configure any registry modifications in place on existing management servers for the new MS
  • Patch new MS with current UR to bring parity with other management servers in the management group.
  • If you have gateways reporting to old management servers, install certificates from the same trusted publisher on the new MS, and then use PowerShell to change GW to MS assignments.
  • Inspect Resource pools. Make sure old management server is removed from any Resource pools with manual membership, and place new management servers in those resource pools.
  • If you have any 3rd party service installations, ensure they are installed as needed on new MS (connector services, hardware monitoring add-ons.
  • If you have any hard coded script or EXE paths in place for notifications or scheduled tasks, ensure those are moved.
  • If you run the Exchange 2010 Correlation engine – ensure it is moved to a new MS.
  • If you use any URL watcher nodes hard coded to a management server – ensure those are moved to a new MS. (Web Transaction Monitoring)
  • If you have any other watcher nodes – migrate those templates (OLEDB probe, port, etc.)
  • If you have any custom registry keys in place on a MS, to discover it as a custom class for any reason, ensure these are migrated.
  • If you have any special roles, such as the RMSe - migrate them.
  • Ensure the new MS will host optional roles such as web console or console roles if required.
  • Migrate any agent assignments in the console or AD integration.
  • Ensure you have BOTH management servers online for a considerable time to allow all agents to get updated config – otherwise you will orphan the agents until they know about the new management server.
  • If you perform UNIX/LINUX monitoring, these should migrate with resource pools. You will need to import and export SCX certs for the new management servers that will take part in the pool.
  • If you use IM notifications, ensure the prerequisites are installed on the new MS.
  • Ensure any new management servers are allowed to send email notifications to your SMTP server if it uses an access list.
  • If you have any network devices, ensure the discovery is moved to another MS for any MS that is being removed.
  • If you are using AEM, ensure this role is reconfigured for any retiring MS.
  • If you are using ACS and the collector role needs to be migrated, perform this and update the forwarders to their new collector.
  • If you have customized heartbeat settings for the management server, ensure this consistent.
  • If you have any agentless monitored systems (rare) move their proxy server.
  • If you were running a hardware load balancer for the SDK service connections – remove the old management servers and add new ones.
  • Review event logs on new management servers and ensure there aren't any major health issues.
  • Uninstall old management server gracefully.
  • Delete management server object in console if required post-uninstall.

 

If you have any additional steps you feel should be part of this list – feel free to comment.

Event 18054 errors in the SQL application log – in SCOM 2012 R2 deployments

$
0
0

 

I wrote about this issue for SCOM 2007 here:

http://blogs.technet.com/b/kevinholman/archive/2010/10/26/after-moving-your-operationsmanager-database-you-might-find-event-18054-errors-in-the-sql-server-application-log.aspx

When SCOM is installed – it doesn’t just create the databases on the SQL instance – it adds data to the sysmessages view for different error scenarios, to the master database for the instance.

This is why after moving a database, or restoring a DB backup to a rebuilt SQL server, we might end up missing this data. 

These are important because they give very good detailed data about the error and how to resolve it.  If you see these – you need to update your SQL instance with some scripts.

Examples of these events on the SQL server:

Log Name:      Application
Source:        MSSQL$I01
Date:          10/23/2010 5:40:14 PM
Event ID:      18054
Task Category: Server
Level:         Error
Keywords:      Classic
User:          OPSMGR\msaa
Computer:      SQLDB1.opsmgr.net
Description:
Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

You might also notice some truncated events in the OpsMgr event log, on your RMS or management servers:

Event Type:    Warning
Event Source:    DataAccessLayer
Event Category:    None
Event ID:    33333
Date:        10/23/2010
Time:        5:40:13 PM
User:        N/A
Computer:    OMMS3
Description:
Data Access Layer rejected retry on SqlError:
Request: p_DiscoverySourceUpsert — (DiscoverySourceId=f0c57af0-927a-335f-1f74-3a3f1f5ca7cd), (DiscoverySourceType=0), (DiscoverySourceObjectId=74fb2fa8-94e5-264d-5f7e-57839f40de0f), (IsSnapshot=True), (TimeGenerated=10/23/2010 10:37:36 PM), (BoundManagedEntityId=3304d59d-5af5-ba80-5ba7-d13a07ed21d4), (IsDiscoveryPackageStale=), (RETURN_VALUE=1)
Class: 16
Number: 18054
Message: Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

Event Type:    Error
Event Source:    Health Service Modules
Event Category:    None
Event ID:    10801
Date:        10/23/2010
Time:        5:40:13 PM
User:        N/A
Computer:    OMMS3
Description:
Discovery data couldn't be inserted to the database. This could have happened because  of one of the following reasons:

     – Discovery data is stale. The discovery data is generated by an MP recently deleted.
     – Database connectivity problems or database running out of space.
     – Discovery data received is not valid.

The following details should help to further diagnose:

DiscoveryId: 74fb2fa8-94e5-264d-5f7e-57839f40de0f
HealthServiceId: bf43c6a9-8f4b-5d6d-5689-4e29d56fed88
Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage..

 

I have created some SQL scripts which are taken from the initial installation files, and you can download them below.  You simply run them in SQL Management studio to get this data back.

These are for SCOM 2012 R2 ONLY!!!!

 

Download link:   https://gallery.technet.microsoft.com/SQL-to-fix-event-18054-c4375367

Alert Lifecycle Management

$
0
0

 

Sometimes – this is almost a dirty word in some companies.  It is applying an ITSM process around monitoring, to ensure alerts are real, actionable, assigned, accountable, and reportable.

In my travels, I see companies with an excellent process around this.  I also see companies with ZERO process. 

My colleague Nathan Gau has a 3-part series on this topic – check it out over here:

 

http://blogs.technet.com/b/nathangau/archive/2016/02/04/the-anatomy-of-a-good-scom-alert-management-process-part-1-why-is-alert-management-necessary.aspx

The impact of moving databases in SCOM

$
0
0

 

I recently had an interesting customer issue.

We were deploying a new management group to do some performance testing of the impact to SCOM performance as we scale up agents.  This particular management group only had the default MP’s from installing SCOM, and the Base OS MP’s.  Nothing more.

When we scaled up to ~2000 agents, we took a checkpoint at performance.  The console was zippy, and the management servers were having no issues.  However – when we analyzed performance on the database, we saw really high CPU.

image

 

Zooming into a smaller time chunk – the CPU was pretty wild:

 

image

 

What we found – was that the customer had moved the SCOM databases to a different server than originally installed to.  When they did this – they did not fully follow the TechNet instructions, to ensure that SQL Broker is enabled and CLR is enabled.

You can check this :

SQL Broker:

SELECT is_broker_enabled FROM sys.databases WHERE name='OperationsManager'

CLR:

SELECT * FROM sys.configurations WHERE name = 'clr enabled'

Both should return a value of “1” to show they are enabled.

Changing these values are covered here:  https://technet.microsoft.com/en-ca/library/hh278848.aspx

Always make sure you handle the other changes necessary when moving a database, and don’t forget to add the sysmessages back, documented here:  Event 18054 errors in the SQL application log – in SCOM 2012 R2 deployments

 

After making these changes – the impact was significant, going from 50% avg CPU consumption, to 11%.

 

24 hour snapshot:

image

One hour snapshot:

image

 

Whenever you visit a SCOM customer, or inherit a SCOM environment that you don’t know the full history on, they might not have these settings optimized, and they might not even be aware they are impacted, especially if their agent count is low.  There are other symptoms you’d see, such as regular expressions failing in the logs without CLR enabled, and agent discovery not working without SQL broker…. but always a good thing to inspect when reviewing the health of a deployment.

Windows 10 Client MP’s are available

$
0
0

 

image

 

Download here:    https://www.microsoft.com/en-us/download/details.aspx?id=51189

The client OS MP’s are available when you need to monitor Windows clients in your SCOM management group.  These might be “light” monitoring of desktops and laptops in the organization, or these might be for mission critical roles such as Kiosks and ATM type machines running a Windows client OS.

 

image

The MP’s will upgrade your base client library (still has a name referencing to SCOM 2007 but these are applicable to SCOM 2012) and will import additional MP’s specific to discovering and monitoring Windows 10 clients.

 

image

If you are importing this MP for Windows 10 clients, and you also already monitor Windows 8 clients, make SURE you update your Windows 8 MP’s to the latest version 6.0.7251.0 available here:  https://www.microsoft.com/en-us/download/details.aspx?id=38434    6.0.7251.0 MP’s contain a fix to stop discovering a Win10 client as a Windows 8 client, otherwise you will get duplicate monitoring and overload your Win10 clients unnecessarily.  Make sure you upgrade the Windows 8 MP’s FIRST before installing the agents on any Windows 10 clients.  If you still have duplicate instances of Windows 8 Computer for a Windows 10 client, you need to delete the agent from Agent Managed in SCOM, then approve them again, and this will clean up the old discovered objects from the Windows 8 client MP’s.

 

Individual workflows are enabled on every client computer, to discover and monitor disks, memory, CPU, etc.  However, the monitors are all set to not generate alerts via overrides.  You have to put clients in a “Business Critical” group in order to see alerts for these clients.  However, the monitors will still show health state for all clients.  Just not alerts.

Same goes for performance collection rules.  There are overrides to enable these (all disabled out of the box) and collect performance data for business critical computers.

The guide also discusses the use of aggregate client monitoring.  These load special workflows that fill the data warehouse with trending reports, and run SQL queries against the warehouse on a regular basis.  Make sure you DON’T import the Aggregate MP’s if you don’t want or need this type of monitoring, as it is optional.

See the MP guide for advanced details on how to configure this MP, and other client OS management packs.

Base OS MP’s have been updated – version 6.0.7303.0

$
0
0

 

***WARNING***  There are some significant issues in this release of the Base OS MP, I do not recommend applying this one until an updated version comes out.

Issues:

  • Cluster Disks on Server 2008R2 clusters are no longer discovered as cluster disks.
  • Cluster Disks on Server 2008 clusters are not discovered as logical disks.
  • Quorum (or small size) disks on clusters that ARE discovered as Cluster disks, do not monitor for free space correctly.
  • Cluster shared volumes are discovered twice, once as a Cluster Shared Volume instance, and once as a Logical disk instance, with the latter likely cause by enabling mounted disk discovery.
  • On Hyper-V servers, I discover an extra disk, which has no properties:

image

 

 

What was changed?

 

From the guide:

MP used to discover physical CPU, which performance monitor instance name property was not correlated with Windows PerfMon object (expecting instance name in (socket, core) format). That affected related rules and monitors. With this release, MP discovers logical processors, rather than physical, and populates performance monitor instance name in proper format

That was a real problem for anyone trying to monitor individual CPU’s in the past – we actually discovered “sockets” not cores – so this didn’t jive with Perfmon at all.  I look forward to testing this.

Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.mp and Microsoft.Windows.Server.Library.mp scripts code migration to PowerShell in scope of Windows Server 2016 Nano support (relevantly introduced in Windows Server 2016 MP version 10.0.1.0).

It is these changes that likely broke cluster disk discovery.

Updated Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.ClusterSharedVolume.Monitoring.State monitor alert properties and description. The fix resolved property replacement failure warning been generated on monitor alert firing.

Exchange 2013 Addendum MP – for Exchange 2013 and 2016

$
0
0

 

image

 

 

 

The Exchange 2013 MP has been released for some time now.  The current version at this writing is 15.0.666.19 which you can get HERE

This MP can be used to discover and monitor Exchange Server 2013 and 2016.

 

 

 

 

However, one of the things I always disliked about this MP – is that it does not use a seed class discovery.  Therefore – it runs a PowerShell script every 4 hours on EVERY machine in your management group, looking for Exchange servers.  The problem with this, is that it doesn’t follow best practices.  As a general best practice, we should NOT run scripts on all servers unless truly necessary.  Another issue – many customers have servers running 2003 and 2008 that DON’T have PowerShell installed!  You will see nuisance events like the following:

 

Event Type:    Error
Event Source:    Health Service Modules
Event Category:    None
Event ID:    21400
Date:        3/2/2016
Time:        3:29:26 AM
User:        N/A
Computer:    WINS2003X64
Description:
Failed to create process due to error '0x80070003 : The system cannot find the path specified.
', this workflow will be unloaded.
Command executed:    "C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe" -PSConsoleFile "bin\exshell.psc1" -Command "& '"C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Monitoring Host Temporary Files 85\26558\MicrosoftExchangeDiscovery.ps1"'" 0 '{3E7D658E-FA5E-924E-334E-97C84E068C4A}' '{B21B34F9-2817-4800-73BD-012E79609F7E}' 'wins2003x64.dmz.corp' 'wins2003x64' 'Default-First-Site-Name' 'dmz.corp' '' '' '0' 'false'
Working Directory:    C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\Monitoring Host Temporary Files 85\26558\
One or more workflows were affected by this. 
Workflow name: Microsoft.Exchange.15.Server.DiscoveryRule
Instance name: wins2003x64.dmz.corp
Instance ID: {B21B34F9-2817-4800-73BD-012E79609F7E}
Management group: OMMG1

 

 

So, I have created an addendum MP which should resolve this.  My MP creates a class and discovery, looking for “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\Setup\MsiInstallPath” in the registry.  If it finds the registry path, SCOM will add it as an instance of my seed class.

image

 

Then, I created a group of Windows Computer objects that “contain” an instance of the seed class. 

image

 

Next, I added an override to disable the main script discovery the Exchange 2013 MP.

Finally, I added an override to enable this same discovery, for my custom group.  This should have the effect that our Exchange discovery script ONLY runs on server that actually have Exchange installed (based on the registry key)

image

 

 

This works for discovering Exchange 2013 and Exchange 2016 with the current Exchange 2013 MP.

 

You can download this sample MP at the following location:

https://gallery.technet.microsoft.com/Exchange-Server-2013-and-cfdfcf2f


How to generate an alert and make it look like it came from someone else

$
0
0

 

This capability has been around forever, but I have never seen it documented.  This is a really cool way to generate alerts as if they came from other agents, but target a different agent.

Suppose a scenario:  You have a client/server application (such as a backup program) where a central server logs all the events about success or failed jobs from clients.

This is scenario – we could simply generate alerts targeting the central server, and reading the event log, and bubble up the broken client name from the logs, into the alert.  The challenge becomes, what if some agents are test, or dev, and some are prod?  What if we have already put in place “tiering” of servers by groupings, and we use this to filter which alerts from which servers get ticketed?

There is actually a way to target one instance of a class with a workflow, but to generate alerts as if they came from a different instance of a different class, EVEN if that instance is a different agent altogether!

Let me demonstrate:

The most common write action for generating alerts in rules, is System.Health.GenerateAlert, which is the one commonly used in every Alert Generating rule you typically come across.  It is documented here:  https://msdn.microsoft.com/en-us/library/ee809352.aspx

HOWEVER – there is another write action you can use:  System.Health.GenerateAlertForType. 

This is documented here:  https://msdn.microsoft.com/en-us/library/jj130310.aspx  While we document the modules and a sample XML example, we don’t really give much guidance anywhere on use cases.

This is a really cool write action, which allows us to generate alerts “on behalf” of a different object type, or even a different object type from a different computer!  Let me show the difference:

A typical System.Health.GenerateAlert looks like this:

<WriteAction ID="GenerateAlert" TypeID="Health!System.Health.GenerateAlert"> <Priority>1</Priority> <Severity>1</Severity> <AlertMessageId>$MPElement[Name="Demo.Rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/EventDescription$</AlertParameter1> </AlertParameters> </WriteAction>

As you can see – very simple.  It sets the priority and severity of the alert, references the Alert Message ID (which is the alert name and description configuration) and contains any alert parameters we want to use in the display output (in this case, Event Description is very common).

 

Now, see the System.Health.GenerateAlertForType:

<WriteAction ID="GenerateAlertForTypeWA" TypeID="Health!System.Health.GenerateAlertForType"> <Priority>1</Priority> <Severity>1</Severity> <ManagedEntityTypeId>$MPElement[Name="Example.Client.Class"]$</ManagedEntityTypeId> <KeyProperties> <KeyProperty> <PropertyId>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>servername.fqdn.local</Value> </KeyProperty> <KeyProperty> <PropertyId>$MPElement[Name="Example.Client.Class"]/ClientName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>servername.fqdn.local</Value> </KeyProperty> </KeyProperties> <AlertMessageId>$MPElement[Name="Demo.Rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/EventDescription$</AlertParameter1> </AlertParameters> </WriteAction>

The key section here is <ManagedEntityTypeId> and then some <KeyProperties>

In the <ManagedEntityTypeId> we need to reference the CLASS that we want the alert to appear as it is coming FROM.

Then, in the <KeyProperties> we need two sections:

The first key property is mapping the Windows Computer principal name to the fqdn of the agent we want the alert to “appear to be from”.  This part is easy.

The second key property is mapping the SAME fqdn, to a matching property on the CLASS we referenced in <ManagedEntityTypeId>, or a parent base class that has the key property defined.

The second key property is the tough one.  The criteria for this to work (from my testing) is that we MUST have a class with a key property first, and that key property MUST be the fqdn of the agent/server for each instance (or whatever value we are “matching” on.

In most of my classes I create, I don’t create key properties.  Key properties aren't required unless I have a class that will discover multiple instances on the same healthservice (agent).  For stuff I do – this is rarely the case.  However, it is EASY to create a key property for your custom classes, and many Microsoft classes already have key properties.  The big “gotchya” here is that in order to generate an alert for another instance of a class (not the targeted instance), the class we specify MUST have a key property defined for this to work.

So – I simply added a key property of “ClientName” to my custom class, and then to discover it, all I have to do is add some simple code to the discovery which maps the hosting Windows Computer principal name to the property.

Ok…. I know…. I probably lost a lot of you up to this point….. but it is easier to just do it, than it is to understand it.  That’s why I will post my XML examples at a link below.  Smile

 

Here is an example of me adding a custom key property to my custom class:

<ClassType ID="Example.AlertFromAnotherInstance.Client.Class" Accessibility="Public" Abstract="false" Base="Windows!Microsoft.Windows.LocalApplication" Hosted="true" Singleton="false" Extension="false"> <Property ID="ClientName" Type="string" AutoIncrement="false" Key="true" CaseSensitive="false" MaxLength="256" MinLength="0" Required="false" Scale="0" /> </ClassType>

And here is part of the discovery that I will use to map “ClientName” to the hosting Windows Computer principal name:

 

<Discovery ID="Example.AlertFromAnotherInstance.Client.Class.Discovery" Enabled="true" Target="Windows!Microsoft.Windows.Server.OperatingSystem" ConfirmDelivery="false" Remotable="true" Priority="Normal"> <Category>Discovery</Category> <DiscoveryTypes> <DiscoveryClass TypeID="Example.AlertFromAnotherInstance.Client.Class" /> </DiscoveryTypes> <DataSource ID="DS" TypeID="Windows!Microsoft.Windows.FilteredRegistryDiscoveryProvider"> <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</ComputerName> <RegistryAttributeDefinitions> <RegistryAttributeDefinition> <AttributeName>ClientExists</AttributeName> <Path>SOFTWARE\Demo\Client</Path> <PathType>0</PathType> <AttributeType>0</AttributeType> </RegistryAttributeDefinition> </RegistryAttributeDefinitions> <Frequency>86400</Frequency> <ClassId>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]$</ClassId> <InstanceSettings> <Settings> <Setting> <Name>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Name> <Value>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Value> </Setting> <Setting> <Name>$MPElement[Name="System!System.Entity"]/DisplayName$</Name> <Value>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Value> </Setting> <Setting> <Name>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]/ClientName$</Name> <Value>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Value> </Setting> </Settings> </InstanceSettings>

 

So – now all I need to do it write a rule, and use our new write action.

You can write the event rule like my example will do using the console, or any other tool, then simply modify the write action section in XML.

Here is my simple rule:

 

<Rule ID="Example.AlertFromAnotherInstance.Server.Event.Rule" Enabled="true" Target="Example.AlertFromAnotherInstance.CentralServer.Class" ConfirmDelivery="false" Remotable="true" Priority="Normal" DiscardLevel="100"> <Category>EventCollection</Category> <DataSources> <DataSource ID="Microsoft.Windows.EventCollector" TypeID="Windows!Microsoft.Windows.EventCollector"> <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName> <LogName>Application</LogName> <AllowProxying>false</AllowProxying> <Expression> <And> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="UnsignedInteger">EventDisplayNumber</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="UnsignedInteger">999</Value> </ValueExpression> </SimpleExpression> </Expression> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="String">PublisherName</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="String">TEST</Value> </ValueExpression> </SimpleExpression> </Expression> </And> </Expression> </DataSource> </DataSources> <WriteActions> <WriteAction ID="GenerateAlertForTypeWA" TypeID="Health!System.Health.GenerateAlertForType"> <Priority>1</Priority> <Severity>2</Severity> <ManagedEntityTypeId>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]$</ManagedEntityTypeId> <KeyProperties> <KeyProperty> <PropertyId>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>$Data/Params/Param[1]$</Value> </KeyProperty> <KeyProperty> <PropertyId>$MPElement[Name="Example.AlertFromAnotherInstance.Client.Class"]/ClientName$</PropertyId> <IsCaseSensitive>false</IsCaseSensitive> <Value>$Data/Params/Param[1]$</Value> </KeyProperty> </KeyProperties> <AlertMessageId>$MPElement[Name="Example.AlertFromAnotherInstance.Server.Event.Rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/Params/Param[1]$</AlertParameter1> <AlertParameter2>$Data/EventDescription$</AlertParameter2> </AlertParameters> </WriteAction> </WriteActions> </Rule>

 

The rule is simple.  It simply looks in the Application Event log for an event ID 999, with a event source of “TEST”.  If found, run the write action.  If you scroll down, you can see the write action part, which I will explain:

 

In my rule, I am targeting the workflow to run on the “Server” class.  However, in my write action, I want the alert generated by instances of the “Client” class.  So on my <ManagedEntityTypeId> line, I am using Example.AlertFromAnotherInstance.Client.Class which is my client class ID.

Next, I map the key property for Windows Computer (Principal Name) to the machine I want to appear to generate the alert.  In this case, the name of the affected machine is in Param 1 of my test event, so I am mapping whatever name is in Param1 of the event to generate the alert.

Next, I map the key property of my custom class to the SAME FQDN value.

That’s it!

 

In this example – I create an event on my “Server”, and param 1 of the event will have the name of the client I want the alert to come from:

image

 

Note:  in the above image – the event was logged on a Server named “STORAGE.opsmgr.net” but param1 contained a name of “RD01.opsmgr.net”.

As long as RD01.opsmgr.net hosts an instance of my “Client” class, an alert will be generated as if it came from this server:

 

image

 

 

If you want to test my example XML out in your own environment, simply create some reg keys to be the “Server” and the “Client” instances to be discovered:

HKEY_LOCAL_MACHINE\SOFTWARE\Demo\Server

HKEY_LOCAL_MACHINE\SOFTWARE\Demo\Client

 

The example management pack is available for download at:  https://gallery.technet.microsoft.com/Management-pack-sample-How-8b6741e3

How to remove OMS and Advisor management packs

$
0
0

 

When testing OMS (Previously called Advisor) with SCOM, there is one side effect:  Once connected, the OMS rules import management packs into your management group with no notification or change control process for you.  Furthermore – if you want to remove OMS Management packs from a SCOM management group, there is a rule that will actually re-download them while you are trying to delete them!  This makes OMS very difficult to remove by default.

Brian Wren posted a method to control this behavior here, and I will demonstrate the same.

https://blogs.technet.microsoft.com/msoms/2016/03/16/control-management-pack-updates-between-ms-oms-and-operations-manager/

 

First, create a new management pack to store our temporary overrides – called “OMS Temp Overrides”

Then in the console, go to Authoring > Rules, and set your scope only to “Operations Manager Management Group”

Disable the following two rules:

image

 

This will stop new OMS/Advisor packs from coming down automatically.

 

Now you can start removing the packs as needed from your management group.    You can use PowerShell to do this in bulk, but it will fail for any MP’s with dependencies.  Here is a simple example:

Get-SCOMManagementPack -name “*advisor*” | Remove-SCOMManagementPack

Get-SCOMManagementPack -name “*IntelligencePack*” | Remove-SCOMManagementPack

get-SCOMManagementPack -name “Microsoft.EnterpriseManagement.Mom.Modules.AggregationModuleLibrary” | Remove-SCOMManagementPack

Be VERY careful using the above statements – they are provided as examples only.  Make SURE they return only the ones you wish to remove and not any custom packs you created that happen to match the naming scheme.

Now – that should leave you with just the following MP’s:

 

image

 

Delete your temp Override MP you created, then (quickly) delete the above MP’s in the order above.

That’s it.

 

If you want to bring OMS back into a Management Group – simply import the Advisor Packs in whatever current UR (Update Rollup) you are on, such as these from UR9:

image

How to monitor for event logs and use a script to modify the output – a composite datasource

$
0
0

 

A common request I hear is the customer wants to monitor for events in a Windows Event log.  That part is easy.  We have simple event rules and monitors for that activity.

However – what if the data in the event log needed to be parsed, or modified in some way, before passing to the alert?

 

For instance, I had a customer who needs to monitor for an event from a central server – backup software platform, about job failures from backup clients.  The event is just a single parameter, a big blob of text that has all the data, and the FQDN of the client is in the event description, but surrounded by a lot of other information. 

 

The challenge is – that for ticketing, the customer needs to place ONLY the FQDN of the CLIENT machine (which is in the event body) into a custom field of an alert.

 

SCOM doesn’t have any good data manipulation capability in the native modules, so in this case we will execute a script in response to the event.  We do this by creating a composite datasource, combining the event log module and the script probe action module.

When an event shows up that matches our criteria, we then execute a script to parse the event description, and create a propertybag in order to output this customized data to the Alert write action.

Obviously – one must take care not to put something like this in place for events that might flood the log, because the SCOM agent will try and run a script for each and every event, which could overwhelm the system.  I actually tested this with a pretty significant event flood, and it was not a big deal at all, the system kept up very nicely.

 

For my “test” event – I have created a block of text with the FQDN of the remote client machine in the body of the description:

Log Name:      Application
Source:        TEST
Date:          4/1/2016 5:13:47 PM
Event ID:      888
Computer:      RD01.opsmgr.net
Description:
foo db01.opsmgr.net foo

Then I wrote a simple script to parse this event and gather the second block of text, which will reliably contain my FQDN:

 

''''''''''''''''''''''''''''''' ' ' Basic SCOM vbscript to accept event data and parse/modify it for ouput via propertybag ' ''''''''''''''''''''''''''''''' Option Explicit Dim oAPI, oBag, sParam1, StartTime, EndTime, ScriptTime, CompNameArr, CompName 'Capture script start time StartTime = Now 'Gather the argument passed to the scriptand set to variable sParam1 = WScript.Arguments(0) 'Split the event data into multiple delimited strings in an array CompNameArr = split(sParam1," ") 'Assume the FQDN is always the 2rd "word" in the event data CompName = CompNameArr(1) 'Load the SCOM script API and propertybag Set oAPI = CreateObject("MOM.ScriptAPI") Set oBag = oAPI.CreatePropertyBag() 'Add the CompName into a propertybag oBag.AddValue "CompName", CompName oBag.AddValue "EventDescription", SParam1 'Return the bag for output oAPI.Return(oBag) 'Capture script runtime EndTime = Now ScriptTime = DateDiff("s", StartTime, EndTime) 'Log event with script outputs and runtime Call oAPI.LogScriptEvent("EventParse.vbs from Example.EventAndScript.DS", 9877, 0, "Event Data Passed to script = " & sParam1 & " -- Output after parsing for CompName = " & CompName & " -- Script Execution Completed in " & ScriptTime & " seconds") Wscript.Quit

 

So first off – we need to create the data source.  The easiest tool for creating Composite data sources is the SCOM 2007 R2 Authoring Console.  I’ll just show snippets of XML that you can forklift into your own MP’s:

 

<DataSourceModuleType ID="Example.EventAndScript.DS" Accessibility="Internal" Batching="false"> <Configuration> <xsd:element minOccurs="1" name="LogName" type="xsd:string" /> <xsd:element minOccurs="1" name="EventID" type="xsd:integer" /> <xsd:element minOccurs="1" name="EventSource" type="xsd:string" /> </Configuration> <ModuleImplementation Isolation="Any"> <Composite> <MemberModules> <DataSource ID="EventDS" TypeID="Windows!Microsoft.Windows.EventProvider"> <ComputerName /> <LogName>$Config/LogName$</LogName> <Expression> <And> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="UnsignedInteger">EventDisplayNumber</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="UnsignedInteger">$Config/EventID$</Value> </ValueExpression> </SimpleExpression> </Expression> <Expression> <SimpleExpression> <ValueExpression> <XPathQuery Type="String">PublisherName</XPathQuery> </ValueExpression> <Operator>Equal</Operator> <ValueExpression> <Value Type="String">$Config/EventSource$</Value> </ValueExpression> </SimpleExpression> </Expression> </And> </Expression> </DataSource> <ProbeAction ID="ScriptDS" TypeID="Windows!Microsoft.Windows.ScriptPropertyBagProbe"> <ScriptName>EventParse.vbs</ScriptName> <Arguments>"$Data/Params/Param[1]$"</Arguments> <ScriptBody><![CDATA[ ''''''''''''''''''''''''''''''' ' ' Basic SCOM vbscript to accept event data and parse/modify it for ouput via propertybag ' ''''''''''''''''''''''''''''''' Option Explicit Dim oAPI, oBag, sParam1, StartTime, EndTime, ScriptTime, CompNameArr, CompName 'Capture script start time StartTime = Now 'Gather the argument passed to the scriptand set to variable sParam1 = WScript.Arguments(0) 'Split the event data into multiple delimited strings in an array CompNameArr = split(sParam1," ") 'Assume the FQDN is always the 2rd "word" in the event data CompName = CompNameArr(1) 'Load the SCOM script API and propertybag Set oAPI = CreateObject("MOM.ScriptAPI") Set oBag = oAPI.CreatePropertyBag() 'Add the CompName into a propertybag oBag.AddValue "CompName", CompName oBag.AddValue "EventDescription", SParam1 'Return the bag for output oAPI.Return(oBag) 'Capture script runtime EndTime = Now ScriptTime = DateDiff("s", StartTime, EndTime) 'Log event with script outputs and runtime Call oAPI.LogScriptEvent("EventParse.vbs from Example.EventAndScript.DS", 9877, 0, "Event Data Passed to script = " & sParam1 & " -- Output after parsing for CompName = " & CompName & " -- Script Execution Completed in " & ScriptTime & " seconds") Wscript.Quit ]]></ScriptBody> <TimeoutSeconds>30</TimeoutSeconds> </ProbeAction> </MemberModules> <Composition> <Node ID="ScriptDS"> <Node ID="EventDS" /> </Node> </Composition> </Composite> </ModuleImplementation> <OutputType>System!System.PropertyBagData</OutputType> </DataSourceModuleType>

 

The above composite datasource allows a rule or monitor to call on it, and pass the three required data items to the Microsoft.Windows.EventProvider DS:  Event ID, Event Log Name, and Event source.  Obviously you could modify this as needed.

Next, the Microsoft.Windows.ScriptPropertyBagProbe is called.  I place my script in here, along with the event argument from the event DS, which I will just use “$Data/Params/Param[1]$”.  When an event just contains a single parameter, everything in the event description is treated as Param 1.

In the script – I am outputting two propertybags:

1.  The original event description

2.  The parsed out “CompName” which is the fqdn I am after.

 

Next – I create my rule.  This is much easier creating from scratch using the SCOM 2007 R2 authoring console, or VSAE if you are used to that.  Most of the time I find myself just forklifting XML if I can find a close enough match doing what I want.  Here is the rule:

 

 

<Rule ID="Example.EventAndScript.Rule" Enabled="true" Target="Windows!Microsoft.Windows.Server.OperatingSystem" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100"> <Category>Custom</Category> <DataSources> <DataSource ID="DS" TypeID="Example.EventAndScript.DS"> <LogName>Application</LogName> <EventID>888</EventID> <EventSource>TEST</EventSource> </DataSource> </DataSources> <WriteActions> <WriteAction ID="Alert" TypeID="Health!System.Health.GenerateAlert"> <Priority>1</Priority> <Severity>2</Severity> <AlertMessageId>$MPElement[Name="Example.EventAndScript.Event888.rule.AlertMessage"]$</AlertMessageId> <AlertParameters> <AlertParameter1>$Data/Property[@Name='EventDescription']$</AlertParameter1> <AlertParameter2>$Data/Property[@Name='CompName']$</AlertParameter2> </AlertParameters> <Custom1>$Data/Property[@Name='CompName']$</Custom1> </WriteAction> </WriteActions> </Rule>

 

 

VERY simple.  I pass my three required configuration items:

            <LogName>Application</LogName>
            <EventID>888</EventID>
            <EventSource>TEST</EventSource>

 

And I configured the Alert write action to output both the full event description, and the parsed FQDN to the alert description.

image

 

Furthermore – to meet the requirements of putting the FQDN in a consistent location for my Ticketing system, I put the parsed FQDN into Custom Field 1 in the alert.

image

 

Now the ticketing system can look for this and make sure the ticket is assigned to the correct server owner.

And interesting option with this kind of workflow – is that you could also potentially make the alert in SCOM “appear” as if it originated from the FQDN in the event, as long as that server has a SCOM agent installed and is part of the same management group:    How to generate an alert and make it look like it came from someone else

 

 

You can download the entire sample MP which contains everything above on TechNet gallery:

https://gallery.technet.microsoft.com/SCOM-Management-Pack-to-dec108c6

Writing events with parameters using PowerShell

$
0
0

 

When we write scripts for SCOM workflows, we often log events as the output, for general logging, debug, or for the output as events to trigger other rules for alerting.  One of the common things I need when logging, is the ability to write parameters to the event.  This helps in making VERY granular criteria for SCOM alert rules to match on.

 

One of the things I HATE about the MOM Script API LogScriptEvent method, is that it places all the text into a single blob of text in the event description, all of this being Parameter 1.

Luckily – there is a fairly simple method to create paramitized events to output using your own PowerShell scripts.  I got this from Mark Manty, a fellow PFE.

 

Here is a basic script that demonstrates the capability:

 

#Script to create events with parameters #Define the event log and your custom event source $evtlog = "Application" $source = "MyEventSource" #These are just examples to pass as parameters to the event $hostname = "computername.domain.net" $timestamp = (get-date) #Load the event source to the log if not already loaded. This will fail if the event source is already assigned to a different log. if ([System.Diagnostics.EventLog]::SourceExists($source) -eq $false) { [System.Diagnostics.EventLog]::CreateEventSource($source, $evtlog) } #function to create the events with parameters function CreateParamEvent ($evtID, $param1, $param2, $param3) { $id = New-Object System.Diagnostics.EventInstance($evtID,1); #INFORMATION EVENT #$id = New-Object System.Diagnostics.EventInstance($evtID,1,2); #WARNING EVENT #$id = New-Object System.Diagnostics.EventInstance($evtID,1,1); #ERROR EVENT $evtObject = New-Object System.Diagnostics.EventLog; $evtObject.Log = $evtlog; $evtObject.Source = $source; $evtObject.WriteEvent($id, @($param1,$param2,$param3)) } #Command line to call the function and pass whatever you like CreateParamEvent 1234 "The server $hostname was logged at $timestamp" $hostname $timestamp

 

The script uses some variables to set which log you want to write to, and what your custom source is.

The rest is pretty self explanatory from the comments.

You can add additional params if needed to the function and the command line calling the function.

 

Here is an event example:

 

image

 

 

But the neat stuff shows up in the XML view where you can see the parameters:

 

image

Update: Automating Run As Account distribution dynamically

$
0
0

 

Just an FYI – I have updated the automatic run as account distribution script I published, to make it more reliable in large environments, limit resources used and decrease the chance of a timeout, along with adding better debug logging.

 

Get the script and read more here:

Automating Run As Account Distribution – Finally!

 

I also published this script in a simple management pack with a rule, which will run the script once a day in your management group.  It targets the All Management Servers Resource Pool so this will have high availability and only run on the single Management Server that is hosting that object.

 

Get the Management Pack here:

https://gallery.technet.microsoft.com/Management-Pack-to-06730af3

Viewing all 179 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>