Tuning tip: Do you have monitors constantly “flip flopping” ?

December 21, 2009, 11:36 am

≫ Next: How to get your agents back to “Remotely Manageable” in OpsMgr 2007 R2

≪ Previous: The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService

This is something I see in almost all clients when we perform a PFE Health Check. The customer will have lots of data being inserted into the OpsDB from agents, about monitors that are constantly changing state. This can have a very negative effect on overall performance of the database – because it can be a lot of data, and the RMS is busy handling the state calculation, and synching this data about the state and any alert changes to the warehouse.

Many times the OpsMgr admin has no idea this is happening, because the alerts appear, and then auto-resolve so fast, you never see them – or don’t see them long enough to detect there is a problem. I have seen databases where the statechangeevent table was the largest in the database – caused by these issues.

Too many state changes are generally caused by one or both, of two issues:

1. Badly written monitors that flip flop constantly. Normally – this happens when you target a multi-instance perf counter incorrectly. See my POST on this topic for more information.

2. HealthService restarts. See my POST on this topic.

How can I detect if this is happening in my environment?

That is the right question! For now – you can run a handful of SQL queries, which will show you the most common state changes going on in your environments. These are listed on my SQL query blog page in the State section:

Noisiest monitors in the database: (Note – these will include old state changes – might not be current)

select distinct top 50 count(sce.StateId) as NumStateChanges, m.MonitorName, mt.typename AS TargetClass
from StateChangeEvent sce with (nolock)
join state s with (nolock) on sce.StateId = s.StateId
join monitor m with (nolock) on s.MonitorId = m.MonitorId
join managedtype mt with (nolock) on m.TargetManagedEntityType = mt.ManagedTypeId
where m.IsUnitMonitor = 1
group by m.MonitorName,mt.typename
order by NumStateChanges desc

The above query will show us which monitors are flipping the most in the entire database. This includes recent, and OLD data. You have to be careful looking at this output – as you might spent a lot of time focusing on a monitor that had a problem long ago. You see – we will only groom out old state changes for monitors that are CURRENTLY in a HEALTHY state, AT THE TIME that grooming runs. We will not groom old state change events if the monitor is Disabled (unmonitored), in Maintenance Mode, Warning State, or Critical State.

What?

This means that if you had a major issue with a monitor in the past, and you solved it by disabling the monitor, we will NEVER, EVER groom that junk out. This doesn't really pose a problem, it just leaves a little database bloat, and messy statechangeevent views in HealthExplorer. But the real issue for me is – it makes it a bit tougher to only look at the problem monitors NOW.

To see if you have really old state change data leftover in your database, you can run the following query:

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

You might find you have a couple YEARS worth of old state data.

So – I have taken the built in grooming stored procedure, and modified the statement to groom out ALL statechange data, and only keep the number of days you have set in the UI. (The default setting is 7 days). I like to run this “cleanup” script from time to time, to clear out the old data, and whenever I am troubleshooting current issues with monitor flip-flop. Here is the SQL query statement:

To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state. By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.

USE [OperationsManager]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN

    SET NOCOUNT ON

    DECLARE @Err int
    DECLARE @Ret int
    DECLARE @DaysToKeep tinyint
    DECLARE @GroomingThresholdLocal datetime
    DECLARE @GroomingThresholdUTC datetime
    DECLARE @TimeGroomingRan datetime
    DECLARE @MaxTimeGroomed datetime
    DECLARE @RowCount int
    SET @TimeGroomingRan = getutcdate()

    SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
    FROM dbo.PartitionAndGroomingSettings
    WHERE ObjectName = 'StateChangeEvent'

    EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
    SET @Err = @@ERROR

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    SET @RowCount = 1

    -- This is to update the settings table
    -- with the max groomed data
    SELECT @MaxTimeGroomed = MAX(TimeGenerated)
    FROM dbo.StateChangeEvent
    WHERE TimeGenerated < @GroomingThresholdUTC

    IF @MaxTimeGroomed IS NULL
        GOTO Success_Exit

    -- Instead of the FK DELETE CASCADE handling the deletion of the rows from
    -- the MJS table, do it explicitly. Performance is much better this way.
    DELETE MJS
    FROM dbo.MonitoringJobStatus MJS
    JOIN dbo.StateChangeEvent SCE
        ON SCE.StateChangeEventId = MJS.StateChangeEventId
    JOIN dbo.State S WITH(NOLOCK)
        ON SCE.[StateId] = S.[StateId]
    WHERE SCE.TimeGenerated < @GroomingThresholdUTC
    AND S.[HealthState] in (0,1,2,3)

    SELECT @Err = @@ERROR
    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    WHILE (@RowCount > 0)
    BEGIN
        -- Delete StateChangeEvents that are older than @GroomingThresholdUTC
        -- We are doing this in chunks in separate transactions on
        -- purpose: to avoid the transaction log to grow too large.
        DELETE TOP (10000) SCE
        FROM dbo.StateChangeEvent SCE
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE TimeGenerated < @GroomingThresholdUTC
        AND S.[HealthState] in (0,1,2,3)

        SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

        IF (@Err <> 0)
        BEGIN
            GOTO Error_Exit
        END
    END

    UPDATE dbo.PartitionAndGroomingSettings
    SET GroomingRunTime = @TimeGroomingRan,
        DataGroomedMaxTime = @MaxTimeGroomed
    WHERE ObjectName = 'StateChangeEvent'

    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END
Success_Exit:
Error_Exit:
END

Once this is cleaned up – you can re-run the DATEDIFF query – and see you should only have the same number of days as set in your UI retention setting for database grooming.

Now – you can run the “Most common state changes” query – and identify which monitors are causing the problem.

Look for monitors at the top with MUCH higher numbers than all others. This will be “monitor flip flop” and you should use Health Explorer to find that monitor on a few instances – and figure out why it is changing state so much in the past few days. Common conditions for this one are badly written monitors that target a single instance object, but monitor a multi-instance perf counter. You can read more on that HERE. Also – just poor overall tuning can cause this – or poorly written custom script based monitors.

If you see a LOT of similar monitors at the top, with very similar state change counts, this is often indicative of HealthService restarts. The Health service will submit new state change data every time it starts up. So if the agent is bouncing every 10 minutes, that is a new state change for ALL monitors on that agent, every 10 minutes. You can read more about this condition at THIS blog post.

↧

How to get your agents back to “Remotely Manageable” in OpsMgr 2007 R2

February 20, 2010, 10:00 am

≫ Next: Bulk enable of agent proxy setting

≪ Previous: Tuning tip: Do you have monitors constantly “flip flopping” ?

You may notice that there are actions you might want to take on an agent, that are grayed out and not available in the console.

There actions might include:

Change Primary Management Server
Repair
Uninstall

See the below image for an example:

This is caused by a flag in the database, which has marked that particular agent as “Not Remotely Manageable”… or “IsManuallyInstalled”.

In order to better see this setting in the UI – you need to personalize the “Agent Managed” view. Right click in the header bar at the top of the “Agent Managed” view, near where it says “Health State” and choose “Personalize View”

In the view options – add the “Remotely Manageable” column:

Now – you can sort by this column and easily find the agents that you have no control over in the console:

***Another thing to note – is if the “Remotely Manageable” flag is set to “No”… we will NOT put those agents into “Pending Management” for a hotfix (when a SCOM hotfix that should also be delivered to agents is applied to a management server). This is by design.

Now…. the question is – WHY are there systems with this flag set to NO?

These MIGHT be unavailable to you for a very good reason…. Basically – for ANY agent that was manually installed – and you ever had to “Approve” the agent – we will set Remotely Manageable to No, by design. The thought process behind this, is that if an agent is manually installed…. we assume it is that way for a reason, and we don't want to *break* anything by controlling it from the UI moving forward.

Here are some examples of manually installed agents that should NOT be controlled in the UI:

AD integrated agents. If you are using Active Directory integration to assign an agent to specific management servers – you don't want to ever break this by changing the management server manually, or running a repair, as this will break AD integration.
Agents behind a firewall, that cannot be repaired… or that only have ports opened to specific management servers. If you had multiple management servers, and only allowed a specific agent access to one of them in a firewall – if you manually changed the MS you could orphan the agent.

Now – for most customers I work with – the two issues above don't apply. If they do – then DON’T change the Remotely Manageable flag!

However – for many customers, the issues above do not apply…. and they end up with a large number of agents that get this flag inadvertently set to “No”. They do not desire this behavior. Here is what can happen to set this flag to No… and when this will be undesirable:

Sometimes you will be troubleshooting a (previously) push installed agent – but will delete the agent under “Agent Managed”… and let it re-detect, and then approve it. SCOM will now treat that agent as manually installed and flag it as such in the database/console.
Sometimes you will have a troublesome agent that will not push deploy for some reason, so you manually install/approve a handful of those.
Sometimes you are having issues getting an agent to work, and in the troubleshooting process, you manually uninstall/reinstall/approve the agent as a quick fix.

In these cases…. we really need a way to “force” this Remotely Manageable flag back to “Yes” when we understand the issue, and know why it got flagged as “No”….. but desire future ability to repair, uninstall, change MS, and put into pending actions for a hotfix down the road.

Unfortunately – the only way to do that today is via a database edit. However, it is relatively simple to do.

Below are the queries to better understand this, and modify your agents. Remember – DON’T do this IF you are using AD integration or have agents that you require to be left alone.

Here is a query just to SEE Agents which are set as manually installed:

select bme.DisplayName from MT_HealthService mths
INNER JOIN BaseManagedEntity bme on bme.BaseManagedEntityId = mths.BaseManagedEntityId
where IsManuallyInstalled = 1

Here is a query that will set ALL agents back to Remotely Manageable:

UPDATE MT_HealthService
SET IsManuallyInstalled=0
WHERE IsManuallyInstalled=1

Now – the above query will set ALL agents back to “Remotely Manageable = Yes” in the console. If you want to control it agent by agent – you need to specify it by name here:

UPDATE MT_HealthService
SET IsManuallyInstalled=0
WHERE IsManuallyInstalled=1
AND BaseManagedEntityId IN
(select BaseManagedEntityID from BaseManagedEntity
where BaseManagedTypeId = 'AB4C891F-3359-3FB6-0704-075FBFE36710'
AND DisplayName = 'agentname.domain.com')

So – I change my agents back to Remotely Manageable…. and refresh my console Agent Managed View…. and voila! I can now modify the MS, repair, uninstall, etc:

↧

Bulk enable of agent proxy setting

November 13, 2007, 3:21 pm

≫ Next: Antivirus Exclusions for MOM and OpsMgr

≪ Previous: How to get your agents back to “Remotely Manageable” in OpsMgr 2007 R2

As you know.... this setting must be enabled for many agents under OpsMgr 2007.

Primarily - for Active Directory, Exchange, Cluster node servers, SMS, and Sharepoint 2007 server agents.

We have a ton of choices out there on how to easily enable this setting for multiple computers at once:

GUI tool: http://blogs.msdn.com/boris_yanushpolsky/archive/2007/08/02/enabling-proxying-for-agents.aspx

Command line tool: http://blogs.technet.com/cliveeastwood/archive/2007/08/30/operations-manager-2007-agent-proxy-command-line-tool-proxycfg.aspx

And two different PowerShell examples: http://www.systemcenterforum.org/downloads/#OperationsManager2007

↧

Antivirus Exclusions for MOM and OpsMgr

December 12, 2007, 12:05 pm

≫ Next: Agent discovery and push troubleshooting in OpsMgr 2007

≪ Previous: Bulk enable of agent proxy setting

Antivirus Exclusions in MOM 2005 and OpsMgr 2007:

Processes:

Excluding by process executable is very dangerous, in that it limits the control of scanning potentially dangerous files handled by the process, because it excludes any and all files involved. For this reason, unless absolutely necessary, we will not exclude any process executables in AV configurations for MOM servers. If you do want to exclude the processes – they are documented below:

MOM 2005 – momhost.exe

OpsMgr 2007 – monitoringhost.exe

Exclusion by Directories:

Real-time, scheduled scanner and local scanner file extension specific exclusions for Operations Manager: The directories listed here are default application directories. You may need to modify these paths based on your client specific designs. Only the following MOM\OpsMgr related directories should be excluded.

Important Note: When a directory to be excluded is greater than 8 characters in length, add both the short and long file names of the directory into the exclusion list. To traverse the sub-directories, this is required by some AV programs.

SQL Database Servers:

These include the SQL Server database files used by Operations Manager components as well as system database files for the master database and tempdb. To exclude these by directory, exclude the directory for the LDF and MDF files:

Examples:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data

D:\MSSQL\DATA

E:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Log

MOM 2005 (management servers and agents):

These include the queue and log files used by Operations Manager.

Example:

C:\Documents and Settings\All Users\Application Data\Microsoft\Microsoft Operations Manager\

OpsMgr 2007 (management servers and agents):

These include the queue and log files used by Operations Manager.

Example:

C:\Program Files\System Center Operations Manager 2007\Health Service State\Health Service Store

Exclusion of File Type by Extensions:

Real-time, scheduled scanner and local scanner file extension specific exclusions for Operations Manager:

SQL Database Servers:

These include the SQL Server database files used by Operations Manager components as well as system database files for the master database and tempdb.

Examples:

MDF, LDF

MOM 2005 (management servers and agents):

These include the queue and log files used by Operations Manager.

Example:

WKF, PQF, PQF0, PQF1

OpsMgr 2007 (management servers and agents):

These include the queue and log files used by Operations Manager.

Example:

EDB, CHK, LOG.

Notes:

Page files should also be excluded from any real time scanning.

↧

Agent discovery and push troubleshooting in OpsMgr 2007

December 12, 2007, 12:36 pm

≫ Next: Installing the OpsMgr 2007 agent on an ISA 2004 or ISA 2006 server

≪ Previous: Antivirus Exclusions for MOM and OpsMgr

OpsMgr 2007 Agent troubleshooting:

There is a GREAT graphical display of the Agent discovery and push process, taken from:

http://blogs.technet.com/momteam/archive/2007/12/10/how-does-computer-discovery-work-in-opsmgr-2007.aspx

Agent Prerequisites:

Supported Operating System Version (see below)
Windows Installer 3.1
MSXML 6 Parser

Agent push requirements (including firewall ports):

The account being used to push the agent must have local admin rights on the targeted agent machine.
The following ports must be open:

RPC endpoint mapper Port number: 135 Protocol: TCP/UDP
*RPC/DCOM High ports (2000/2003 OS) Ports 1024-5000 Protocol: TCP/UDP
*RPC/DCOM High ports (2008 OS) Ports 49152-65535 Protocol: TCP/UDP
NetBIOS name service Port number: 137 Protocol: TCP/UDP
NetBIOS session service Port number: 139 Protocol: TCP/UDP
SMB over IP Port number: 445 Protocol: TCP
MOM Channel Port number: 5723 Protocol: TCP/UDP

The following services must be set:

Display Name: Netlogon Started Auto Running
**Display Name: Remote Registry Started Auto Running
Display Name: Windows Installer Started Manual Running
Display Name: Automatic Updates Started Auto Running

*The RPC/DCOM High ports are required for RPC communications. This is generally why we don't recommend/support agent push in a heavily firewalled environment, because opening these port ranges creates a potential security issue that negates the firewall boundary. For more information:

http://support.microsoft.com/kb/154596/

http://support.microsoft.com/default.aspx?scid=kb;EN-US;929851

Important: Don’t change the RPC high ports without have an deep understanding of your environment and the potential impact !!!

**Not required for agent push, but required for some management packs.

The remote management server must be able to connect to the remote agent machine via WMI and execute WMI Query "Select * from Win32_OperatingSystem". WMI must be running, and healthy, and allowing remote connections.
The management server must be able to connect to the targeted agent machine via \\servername\c$

Logging:

When pushing an agent from a management server, a log will be written in the event of a failure to: \Program Files\System Center OpsMgr\AgentManagement\AgentLogs\ on the Management Server.
The log on an agent is not enabled by default (like MOM 2005) when using agent push. If you manually install an agent using the MSI – it will place a verbose logfile at C:\documents and settings\%user%\local settings\temp\momagent.log

To troubleshoot agent push with a verbose log – you need to enable verbose MSI logging: http://support.microsoft.com/kb/314852/en-us

Common Agent Push errors:

Below are some common push failures. Also see my troubleshooting table for more details: Console based Agent Deployment Troubleshooting table

The MOM Server detected that the following services on computer "(null);NetLogon" are not running. These services are required for push agent installation. To complete this operation, either start the required services on the computer or install the MOM agent manually by using MOMAgent.msi located on the product CD. Operation: Agent Install
Remote Computer Name: dc1.opsmgr.net
Install account: OPSMGR\localadmin
Error Code: C000296E
Error Description: Unknown error 0xC000296E

Solution: Netlogon service is not running. It must be set to auto/started

The MOM Server detected that the Windows Installer service (MSIServer) is disabled on computer "dc1.opsmgr.net". This service is required for push agent installation. To complete this operation on the computer, either set the MSIServer startup type to "Manual" or "Automatic", or install the MOM agent manually by using MOMAgent.msi located on the product CD.

Operation: Agent Install
Install account: OPSMGR\localadmin
Error Code: C0002976
Error Description: Unknown error 0xC0002976

Solution: Windows Installer service is not running or set to disabled – set this to manual or auto and start it.

The Agent Management Operation Agent Install failed for remote computer dc1.opsmgr.net.
Install account: OPSMGR\localadmin
Error Code: 80070643
Error Description: Fatal error during installation.
Microsoft Installer Error Description:
For more information, see Windows Installer log file "C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\DC1AgentInstall.LOG
C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\DC1MOMAgentMgmt.log" on the Management Server.

Solution: Enable the automatic Updates service…. Install the agent – then disable the auto-updates service if desired.

Additional Info:

There are sub-components to the OpsMgr Agent installer service

1. The service is a standard NT Service. The service also handles registration/un-registration of DCOM object that has logic for handling MSI/MSP.

2. The DCOM object takes directive from the module on OpsMgr Server, this object provides asynchronously installing/uninstalling/updating OpsMgr. It also returns list of currently installed QFEs, verifies pre-requisites like channel connectivity before completing agent install. It handles multi-homing of agent, and reads agent parameters such as version, install dir, etc.

3. RPC is used to establish a connection to the target machine, SMB is used to copy the source files over.

4. WMI is used to check prerequisites.

Agents Inside a Trust Boundary

Discovery:
Discovery requires that the TCP 135 (RPC), RPC range, and TCP 445 (SMB) ports remain open and that the SMB service is enabled.

Installation:
After a target device has been discovered, an agent can be deployed to it. Agent installation requires:

Opening Remote procedure call (RPC) ports beginning with endpoint mapper TCP 135 and the Server Message Block (SMB) port TCP/UDP 445.
Enabling the File and Printer Sharing for Microsoft Networks and the Client for Microsoft Networks services (this ensures that the SMB port is active).
If enabled, Windows Firewall Group Policy settings for Allow remote administration exception and Allow file and printer sharing exception must be set to Allow unsolicited incoming messages from: to the IP address and subnets for the primary and secondary Management Servers for the agent. For more information, see How to Configure the
Windows Firewall to Enable Management of Windows-Based Computers from the Operations Manager 2007 Operations Console.
An account that has local administrator rights on the target computer.
Windows Installer 3.1. To install, see article 893803 in the Microsoft Knowledge Base (http://go.microsoft.com/fwlink/?LinkId=86322).
Microsoft Core XML services (MSXML) 6 on the Operations Manager product installation media in the \msxml sub directory.

Ongoing Management:
Ongoing management of an agent requires that the TCP 135 (RPC), RPC range, and TCP 445 (SMB) ports remain open and that the SMB service remains enabled.

Supported Operating systems for an Agent:

See: Operations Manager 2007 R2 Supported Configurations

↧

Installing the OpsMgr 2007 agent on an ISA 2004 or ISA 2006 server

February 11, 2008, 1:16 pm

≫ Next: Event ID 2115 A Bind Data Source in Management Group

≪ Previous: Agent discovery and push troubleshooting in OpsMgr 2007

When you want to manage and monitor an ISA server, you need to install the OpsMgr agent.

However, there is no guide published for the OpsMgr 2007 ISA MP..... It comes with the MOM 2005 guide. In ISA, there was a system policy which you could enable for MOM. This would open the necessary ports for the MOM agent to communicate with a management server. However, these ports have changed, yet there seems to be no guidance on how to manage an ISA box with SCOM.

I will document the steps necessary:

When you install an OpsMgr agent on a ISA server, you will see in the event log the following event, when the agent starts:

------------------

Event Type:    Error
Event Source:    OpsMgr Connector
Event Category:    None
Event ID:    21006
Date:        2/11/2008
Time:        11:05:36 AM
User:        N/A
Computer:    ISA
Description:
The OpsMgr Connector could not connect to OMRMS:5723. The error code is 10065L(A socket operation was attempted to an unreachable host.
). Please verify there is network connectivity, the server is running and has registered it's listening port, and there are no firewalls blocking traffic to the destination.

-------------------

There is the problem! We DO have a firewall blocking the traffic.

The OpsMgr agent needs to be able to communicate, outbound, to a management server over TCP_5723. We will use this for all communications, including heartbeats. Therefore, we need an access rule to allow this traffic:

1. Create a new access rule. Give it a name according to your corporate ISA rule naming standards. Click Next:

2. Choose "Allow" and click Next.

3. On Protocols, click "Add", then "New" then "Protocol". Give this new protocol a new, such as "OpsMgr Agent tcp_5723"

4. On the "New Protocol Definition Wizard" screen, click "New" and fill out the boxes. We want TCP, Outbound, and port 5723. Then click OK.

5. Click Next, No secondary connections, and then click Finish.

6. Find your new protocol under User Defined, and click "Add", then Close, then Next.

7. On the access rule sources - we want FROM "Localhost", which is located under the "Networks" object:

8. On the "Access Rule Destinations" - we want the IP addresses of all possible OpsMgr management servers/gateways that this ISA will report, or fail over to. For this example, I am using the "Internal" network object, which includes all internally defined IP subnets:

Accepts the default settings for "All Users", click Next, Finish, then apply this new rule to the firewall configuration.

You should no longer see an Event ID 21006, after bouncing the Healthservice on the ISA server. However, in order to support mutual authentication, you might still need to configure Certificates, or rules allowing AD communications if the ISA server is a member of the same forest as the OpsMgr servers.

↧

Event ID 2115 A Bind Data Source in Management Group

April 21, 2008, 6:12 pm

≫ Next: How do I know which hotfixes have been applied to which agents?

≪ Previous: Installing the OpsMgr 2007 agent on an ISA 2004 or ISA 2006 server

I see this event a lot in customer environments. I am not an expert on troubleshooting this here... but saw this post in the MS newsgroups and felt it was worth capturing....

My experience has been that it is MUCH more common to see these when there is a management pack that collects way too much discovery data.... than any real performance problem with the data warehouse. In most cases.... if the issue just started after bringing in a new MP.... deleting that MP solves the problem. I have seen this repeatedly after importing the Cluster MP, Or Exchange 2007 MP.... but haven't been able to fully investigate the root cause yet:

In a nutshell.... if they are happening just a couple times an hour.... and the time in seconds is fairly low (under a few minutes) then this is normal.

If they are happening very frequently - like every minute, and the times are increasing - then there is an issue that needs to be resolved.

Taken from the newsgroups:

-------------------------------------------

In OpsMgr 2007 one of the performance concerns is DB/DW data insertion performance. Here is a description of how to identify and troubleshoot problems with DB/DW data insertion.

Symptoms:

DB/DW write action workflows run on a Management Server, they first keep data received from Agent / Gateway in an internal buffer, then they create a batch of data from the buffer and insert the data batch to DB / DW, when the insertion of the first batch finished, they will create another batch and insert it to DB / DW. The size of the batch depends on how much data is available in the buffer when the batch is created, but there is a maximum limit on the size of the batch, a batch can contain up to 5000 data items. If data item incoming (from Agent / Gateway) throughput becomes larger, or the data item insertion (to DB/DW) throughput becomes smaller, then the buffer will tend to accumulate more data and the batch size will tend to become larger. There are different write action workflows running on a MS, they handle data insertion to DB / DW for different type of data:

Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
Microsoft.SystemCenter.DataWarehouse.CollectEventData
Microsoft.SystemCenter.CollectAlerts
Microsoft.SystemCenter.CollectEntityState
Microsoft.SystemCenter.CollectPublishedEntityState
Microsoft.SystemCenter.CollectDiscoveryData
Microsoft.SystemCenter.CollectSignatureData
Microsoft.SystemCenter.CollectEventData

When a DB/DW write action workflow on Management Server notices that the insertion of a single data batch is slow (ie. slower than 1 minute), it will start to log a 2115 NT event to OpsMgr NT event log once every minute until the batch is inserted to DB / DW or is dropped by DB / DW write action module. So you will see 2115 events in management server's "Operations Manager" NT event log when it is slow to insert data to DB /DW. You might also see 2115 events when there is a burst of data items coming to
Management server and the number of data items in a batch is large. (This can happen during a large amount of discovery data being inserted - from a freshly imported or noisy management pack.)

2115 events have 2 import pieces of information: the name of the workflow that has insertion problem, and the pending time since the workflow started inserting last data batch. Here is an example of a 2115 event:

------------------------------------

A Bind Data Source in Management Group OpsMgr07PREMT01 has posted items to the workflow, but has not received a response in 3600 seconds. This indicates a performance or functional problem with the workflow.

Workflow Id : Microsoft.SystemCenter.CollectSignatureData

Instance : MOMPREMSMT02.redmond.corp.microsoft.com

Instance Id : {6D52A6BB-9535-9136-0EF2-128511F264C4}

------------------------------------------

This 2115 event is saying DB write action workflow "Microsoft.SystemCenter.CollectSignatureData" (which writes performance
signature data to DB) is trying to insert a batch of signature data to DB and it started inserting 3600 seconds ago but the insertion has not finished yet. Normally inserting of a batch should finish within 1 minutes.

Normally, there should not be much 2115 events happening on Management server, if it happens less than 1 or 2 times every hour (per write action workflow), then it is not a big concern, but if it happens more than that, there is a DB /DW insertion problem.

The following performance counters on Management Server gives information of DB / DW write action insertion batch size and insertion time, if batch size is becoming larger (by default maximum batch size is 5000), it means management server is either slow in inserting data to DB/DW or is getting a burst of data items from Agent/Gateway. From the DB / DW write action's Avg. Processing Time, you will see how much time it takes to write a batch of data to DB / DW.

OpsMgr DB Write Action Modules(*)\Avg. Batch Size
OpsMgr DB Write Action Modules(*)\Avg. Processing Time
OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms
OpsMgr DW Writer Module(*)\Avg. Batch Size

Possible root causes:

In OpsMgr, discovery data insertion is relatively expensive, so a discovery burst (a discovery burst is referring to a short period of time when a lot of discovery data is received by management server) could cause 2115 event (complaining about slow insertion of discovery data), since discovery insertion should not happen frequently. So if you see consistently 2115 events for discovery data collection. That means you either have DB /DW insertion problem or some discovery rules in a MP is collecting too much
discovery data.
OpsMgr Config update caused by instance space change or MP import will impact the CPU utilization on DB and will have impact on DB data insertion. After importing a new MP or after a big instance space change in a large environment, you will probably see more than normal 2115 events.
Expensive UI queries can impact the resource utilization on DB and could have impact on DB data insertion. When user is doing expensive UI operation, you will probably see more than normal 2115 events.
When DB / DW is out of space / offline you will find Management server keeps logging 2115 events to NT event log and the pending time is becoming higher and higher.
Sometimes invalid data item sent from agent /Gateway will cause DB / DW insertion error which will end up with 2115 event complaining about DB /DW slow insertion. In this case please check the OpsMgr event log for relevant error events. It's more common in DW write action workflows.
If DB / DW hardware is not configured properly, there could be performance issue, and it could cause slow data insertion to DB / DW. The problem could be:
- The network link between DB / DW to MS is slow (either bandwidth is small / latency is large, as a best practice we recommend MS to be in the same LAN as DB/DW).
- The data / log / tempdb disk used by DB / DW is slow, (we recommend separating data, log and tempdb to different disks, we recommend using RAID 10 instead of using RAID 5, we also recommend turning on write cache of the array controllers).
- The OpsDB tables are too fragmented (this is a common cause of DB performance issues). Reindex affected tables will solve this issue.
- The DB / DW does not have enough memory.

Now - that is the GENERAL synopsis and how to attack them. Next - we will cover a specific issue we are seeing with a specific type of 2115 Event:

-----------------------------------------------

It appears we may be hitting cache resolution error we were trying to catch for a while. This is about CollectEventData workflow. Error is very hard to catch and we're including a fix in SP2 to avoid it. There are two ways to resolve the problem in the meantime. Since the error happens very rarely, you can just restart Health Service on the Management Server that is affected. Or you can prevent it from blocking the workflow by creating overrides in the following way:

-----------------------------------------------

1) Launch Console, switch to Authoring space and click "Rules"
2) In the right top hand side of the screen click "Change Scope"
3) Select "Data Warehouse Connection Server" in the list of types,. click "Ok"
4) Find "Event data collector" rule in the list of rules;
5) Right click "Event data collector" rule, select Overrides/Override the Rule/For all objects of type...
6) Set Max Execution Attempt Count to 10
7) Set Execution Attempt Timeout Interval Seconds to 6

That way if DW event writer fails to process event batch for ~ a minute, it will discard the batch. 2115 events related to
Datawarehouse.CollectEventData should go away after you apply these overrides. BTW, while you're at it you may want to override "Max Batches To Process Before Maintenance Count" to 50 if you have a relatively large environment. We think 50 is better default setting then SP1's 20 in this case and we'll switch default to 50 in SP2.

-------------------------------------------------

Essentially - to know if you are affected by the specific 2115 issue describe above - here is the criteria:

1. You are seeing 2115 bind events in the OpsMgr event log of the RMS or MS, and they are recurring every minute.

2. The events have a Workflow ID of: Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData

3. The "has not received a response" time is increasing, and growing to be a very large number over time.

Here is an example of a MS with the problem: Note consecutive events, from the CollectEventData workflow, occurring every minute, with the time being a large number and increasing:

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:37:06 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706594 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:36:05 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706533 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:35:03 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706471 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

↧

How do I know which hotfixes have been applied to which agents?

June 24, 2008, 6:42 pm

≫ Next: Agent Proxy alerts - finding the right machine to enable agent proxy on using a custom report

≪ Previous: Event ID 2115 A Bind Data Source in Management Group

***UPDATE*** A new hotfix has been released, which is a simple updated management pack.... which fixes the Patchlist table to include all hotfixes, and cleans up the formatting. I recommend you get it and install it on your SP1 environments.

http://support.microsoft.com/kb/958253

-------------------------------------------------------------------------------------

As more hot-fixes are applied to our OpsMgr 2007 SP1 environments.... how can we know which hot-fixes have been applied to our agents? How can we detect an agent that needs patching but got missed?

In MOM 2005... this was rather simple... in the Admin console, under Agent-managed Computers, there was a column called "version" which incremented the agent version number in most cases.

In OpsMgr... we do not update this field in the Administration tab. See graphic: The version here shows the major version number... like RTM 6.0.6500, SP1 6.0.6278.... etc....

So.... how do we examine this now for minor updates?

Create a new State view. Call it "Custom - Agent Patch List" or something you like. Target "Health Service" for "Show Data Related To". You can filter it further to the "Agent Managed Computer Group".

Then - personalize this view, and show the columns for "Name" and "Patch List" See graphic:

Now.... the "Patch List" column isn't super user friendly - because of the amount of text in the single column.... but it will let you see what has been installed. For instance - here is an example of KB950853 installed:

To make this a bit easier.... I wrote the following SQL query which does essentially the same thing.... you can create a web based SQL report from this and the data will be much easier to manage in Excel:

select bme.path AS 'Agent Name', hs.patchlist AS 'Patch List' from MT_HealthService hs
inner join BaseManagedEntity bme on hs.BaseManagedEntityId = bme.BaseManagedEntityId
order by path

If you want to query for all agents missing a specific hot-fix... you could run a query like this.... just change the KB number below (thanks to Brad Turner for providing the idea):

select bme.path AS 'Agent Name', hs.patchlist AS 'Patch List' from MT_HealthService hs
inner join BaseManagedEntity bme on hs.BaseManagedEntityId = bme.BaseManagedEntityId
where hs.patchlist not like '%951380%'
order by path

I have noticed, however, that this field, "Patch List" is limited to 255 characters in the database.... which I imagine will run out of space fairly soon. I will also be interested to see how we handle this table column, once SP2 comes out.... as any pre-SP2 applied hotfixes will no longer apply.

The Patch List information is discovered and updated once per day across all agents in the management group.

For a report which shows you the same information, but lets you query for all agent missing a specific hotfix - check out my more recent post with the report download:

http://blogs.technet.com/kevinholman/archive/2008/06/27/a-report-to-show-all-agents-missing-a-specific-hotfix.aspx

↧

Agent Proxy alerts - finding the right machine to enable agent proxy on using a custom report

June 27, 2008, 5:04 pm

≫ Next: Which servers are DOWN in my company, and which just have a heartbeat failure, RIGHT NOW?

≪ Previous: How do I know which hotfixes have been applied to which agents?

Certain types of agents need the agent proxy setting enabled. These are documented in various guides... such as Exchange Active Directory, Cluster nodes, etc...

However, sometimes, we still get alerts that Agent Proxy needs to be enabled for a HealthService. The problem is... the Alert often doesn't tell us which agent needs this enabled!!!

Here is an example alert:

The alert context tab is telling us about something called "SQLCLUSTER".... but I know that is a virtual cluster instance name... not the name of a real agent.

Marius blogged about a SQL query... that will help us find the agent that needs this turned on:

http://blogs.msdn.com/mariussutara/archive/2007/11/09/agent-proxying-alert.aspx

The sql query I like to use is:

select DisplayName, Path, basemanagedentityid from basemanagedentity where basemanagedentityid = 'guid'

Where "guid" = the GUID of the healthservice in the alert.

So in my example above - the part in bold is the GUID we need..... the HealthService that is causing the problem....

Health service ( 4F6BCCD4-2A41-1C39-DC50-5CE6CA10E0D3 ) should not generate data about this managed object ( A6D9CC33-3EF7-00BF-3E78-B368B32F1486 ).

If we drop this into the query.... it will look like so:

select DisplayName, Path, basemanagedentityid from basemanagedentity where basemanagedentityid = '4F6BCCD4-2A41-1C39-DC50-5CE6CA10E0D3'

Which when run in a SQL query returns:

DisplayName	Path	basemanagedentityid
sqlnode2.opsmgr.net	sqlnode2.opsmgr.net	4F6BCCD4-2A41-1C39-DC50-5CE6CA10E0D3

Aha! So - SQLNODE2 needs agent proxy enabled.

Well.... how instead of all this dropping to SQL.... we create a report with the query above, and have an input parameter where you can just paste in the GUID from the alert? I have created just that, and you can download it below!

Please see my previous post on creating a data source for the OpsDB here: Creating a new data source for reporting against the Operational Database You will need to do that first in order to see this report.... since it runs against the Operational Database, not the Data Warehouse.

Simply download this RDL file, then browse to your reporting website (http://reportingservername/reports) browse to your new custom folder for reports, and choose "Upload File". Your new report is uploaded, and you should be able to see it in the Ops Console under Reporting now:

Open the report. The report just needs you to paste in the healthservice GUID you saw in the Agent Proxy alert:

This should make it a bit easier and faster to tackle these types of alerts in the future.... for the few agents you are missing this setting on.

A simpler way to do this without running a report.... is to use "Discovered Inventory" in the monitoring console.

Select "Change Target Type" from the actions pane, and then choose "View All Targets" and then type "Health Service Watcher" in the "Look For" box. Select the Health Service Watcher Class and click OK.

Now paste your GUID into the "FIND" box and click OK. Make sure you dont have any trainling spaces in the GUID:

The Report download is here:

↧

Which servers are DOWN in my company, and which just have a heartbeat failure, RIGHT NOW?

June 27, 2008, 5:39 pm

≫ Next: A report to show all agents missing a specific hotfix

≪ Previous: Agent Proxy alerts - finding the right machine to enable agent proxy on using a custom report

In OpsMgr 2007, when a agent experiences a heartbeat failure, several things happen. There are diagnostics, and possibly recoveries that are run. Alerts, and possibly notifications go out.

But what happens if my Operations team misses on of these alerts? What can I do to "spot check" agents with issues?

Well, any time an agent has a heartbeat failure, we gray out the state icon of the agents last known state for in each state view.

However - you CAN create a State view that will turn Red or Yellow just like any other state views. Simply create a new State View, and scope the class to Health Service Watcher (Agent).

I called mine Heartbeat State View:

This view will show us when any of the agent health service watcher monitors are unhealthy: In my case - OWA and EXCH1 have issues. OWA is DOWN, while EXCH1 agent healthservice is stopped.

However - here is the issue. This view shows us when ANY monitor rolls up unhealthy state.... this includes heartbeat failures AND computer unreachable (server IP stack is down):

What if I want a State View - to ONLY show me computers that are DOWN.... as in... not heartbeating AND not responding to any PING? Most customers consider this their "most critical situation". Well, I haven't found an easy way to do that.... so I wrote a report which handles it. This report will query the OpsDB for the state of the "Computer Not Reachable" monitor, and only display those servers. It is based on the following query:

SELECT bme.DisplayName, s.LastModified as LastModifiedUTC, dateadd(hh,-5,s.LastModified) as 'LastModifiedCST (GMT-5)'
FROM state AS s, BaseManagedEntity as bme
WHERE s.basemanagedentityid = bme.basemanagedentityid AND s.monitorid
IN (SELECT MonitorId FROM Monitor WHERE MonitorName = 'Microsoft.SystemCenter.HealthService.ComputerDown')
AND s.Healthstate = '3' AND bme.IsDeleted = '0'
ORDER BY s.Lastmodified DESC

You can import this report if you have created a data source as shown in my previous post:

http://blogs.technet.com/kevinholman/archive/2008/06/27/creating-a-new-data-source-for-reporting-against-the-operational-database.aspx

Import this report into your custom folder... and run it. You can schedule it to receive it first thing every day... if you like the output:

***** Update 6-30-08 I removed a section of the original query relating to maintenance mode. We found that if a down server had never been in maintenance mode, the server would not show up in the report. The query and report download have been updated to address this.

Report is attached below:

↧

A report to show all agents missing a specific hotfix

June 27, 2008, 6:11 pm

≫ Next: Agent Pending Actions can get out of synch between the Console, and the database

≪ Previous: Which servers are DOWN in my company, and which just have a heartbeat failure, RIGHT NOW?

This is a continuation of my previous post on determining which agents are missing a hot-fix:

How do I know which hotfixes have been applied to which agents-

I wrote up a report that allows you to paste in a KB article number into the report as a parameter, and then it will show all agents that are potentially missing that hotfix. This will help you easily find agent which need to be patched and got missed for some reason.

You can run this report if you create the SQL reporting data source as specified in my previous post:

Creating a new data source for reporting against the Operational Database

Once imported - it will show up in the console. Open the report, and paste in any KB article number for a OpsMgr hotfix you have applied. The number MUST begin and end with "%".... such as %951380% as shown:

The report is attached below:

↧

Agent Pending Actions can get out of synch between the Console, and the database

September 29, 2008, 2:57 pm

≫ Next: Console based Agent Deployment Troubleshooting table

≪ Previous: A report to show all agents missing a specific hotfix

When you look at your agent pending actions in the Administration pane of the console.... you will see pending actions for things like approving a manual agent install, agent installation in progress, approving agent updates, like from a hotfix, etc.

This pending action information is also contained in the SQL table in the OpsDB - agentpendingaction

It is possible for the agentpendingaction table to get out of synch with the console, for instance, if the server was in the middle of updating/installing an agent - and the management server Healthservice process crashed or was killed.

In this case, you might have a lingering pending action, that blocks you from doing something in the future. For instance - if you had a pending action to install an agent, that did not show up in the pending actions view of the console. What might happen, is that when you attempt to discover and push the agent to this same server, you get an error message:

"One or more computers you are trying to manage are already in the process of being managed. Please resolve these issues via the Pending Management view in Administration, prior to attempting to manage them again"

The problem is - they don't show up in this view!

To view the database information on pending actions:

select * from agentpendingaction

You should be able to find your pending action there - that does not show up in the Pending Action view in the console, if you are affected by this.

To resolve - we should first try and reject these "ghost" pending actions via the SDK... using powershell. Open a command shell, and run the following:

get-agentpendingaction

To see a prettier view:

get-agentpendingaction | ft agentname,agentpendingactiontype

To see a specific pending action for a specific agent:

get-agentPendingAction | where {$_.AgentName -eq "servername.domain.com"}

To reject the specific pending action:

get-agentPendingAction | where {$_.AgentName -eq "servername.domain.com"}|Reject-agentPendingAction

We can use the last line - to reject the specific pending action we are interested in.

You might get an exception running this:

Reject-AgentPendingAction : Microsoft.EnterpriseManagement.Common.UnknownServiceE
xception: The service threw an unknown exception. See inner exception for details
. ---> System.ServiceModel.FaultException`1[System.ServiceModel.ExceptionDetail]:
Exception of type 'Microsoft.EnterpriseManagement.Common.DataItemDoesNotExistExc
eption' was thrown.

If this fails, such as gives an exception, or if our problem pending action doesn't even show up in Powershell.... we have to drop down to the SQL database level. This is a LAST resort and NOT SUPPORTED.... run at your own risk.

There is a stored procedure to delete pending actions.... here is an example, to run in a SQL query window:

exec p_AgentPendingActionDeleteByAgentName 'agentname.domain.com'

Change 'agentname.domain.com' to the agent name that is showing up in the SQL table, but not in the console view.

↧

Console based Agent Deployment Troubleshooting table

January 27, 2009, 9:35 am

≫ Next: Getting and keeping the SCOM agent on a Domain Controller – how do YOU do it?

≪ Previous: Agent Pending Actions can get out of synch between the Console, and the database

This post is a list of common agent push deployment errors… and some possible remediation options.

Most common errors while pushing an agent:

Error	Error Code(s)	Remediation Steps
The MOM Server could not execute WMI Query "Select * from Win32_Environment where NAME='PROCESSOR_ARCHITECTURE'" on computer server.domain.com Operation: Agent Install Install account: domain\account Error Code: 80004005 Error Description: Unspecified error	80004005	1. Check the PATH environment variable. If the PATH statement is very long, due to lots of installed third party software - this can fail. Reduce the path by converting any long filename destinations to 8.3, and remove any path statements that are not necessary. Or apply hotfix: http://support.microsoft.com/?id=969572 2. The cause could be corrupted Performance Counters on the target Agent. To rebuild all Performance counters including extensible and third party counters in Windows Server 2003, type the following commands at a command prompt. Press ENTER after each command. cd \windows\system32 lodctr /R Note /R is uppercase. Windows Server 2003 rebuilds all the counters because it reads all the .ini files in the C:\Windows\inf\009 folder for the English operating system. How to manually rebuild Performance Counter Library values http://support.microsoft.com/kb/300956 3. Manual agent install.
The MOM Server could not execute WMI Query "Select * from Win32_OperatingSystem" on computer “servername.domain.com” Operation: Agent Install Install account: DOMAIN\account Error Code: 800706BA Error Description: The RPC server is unavailable. The MOM Server could not execute WMI Query "(null)” on computer “servername.domain.com” Operation: Agent Install Install account: DOMAIN\account Error Code: 800706BA Error Description: The RPC server is unavailable.	8004100A 800706BA	1. Ensure agent push account has local admin rights 2. Firewall is blocking NetBIOS access. If Windows 2008 firewall is enabled, ensure “Remote Administration (RPC)” rule is enabled/allowed. We need port 135 (RPC) and the DCOM port range opened for console push through a firewall. 3. Inspect WMI service, health, and rebuild repository if necessary 4. Firewall is blocking ICMP (Live OneCare) 5. DNS incorrect
The MOM Server failed to open service control manager on computer "servername.domain.com". Access is Denied Operation: Agent Install Install account: DomainName\User Account Error Code: 80070005 Error Description: Access is denied.	80070005 80041002	1. Verify SCOM agent push account is in Local Administrators group on target computer. 2. On Domain controllers will have to work with AD team to install agent manually if agent push account is not a domain admin. 3. Disable McAfee antivirus during push
The MOM Server failed to open service control manager on computer "servername.domain.com". Therefore, the MOM Server cannot complete configuration of agent on the computer. Operation: Agent Install Install account: DOMAIN\account Error Code: 800706BA Error Description: The RPC server is unavailable.	800706BA	1. Firewall blocking NetBIOS ports 2. DNS resolution issue. Make sure the agent can ping the MS by NetBIOS and FQDN. Make sure the MS can ping the agent by NetBIOS and FQDN 3. Firewall blocking ICMP 4. RPC services stopped.
The MOM Server failed to acquire lock to remote computer servername.domain.com. This means there is already an agent management operation proceeding on this computer, please retry the Push Agent operation after some time. Operation: Agent Install Install account: DOMAIN\account Error Code: 80072971 Error description: Unknown error 0x80072971	80072971	This problem occurs if the LockFileTime.txt file is located in the following folder on the remote computer: %windir%\422C3AB1-32E0-4411-BF66-A84FEEFCC8E2 When you install or remove a management agent, the Operations Manager 2007 management server copies temporary files to the remote computer. One of these files is named LockFileTime.txt. This lock file is intended to prevent another management server from performing a management agent installation at the same time as the current installation. If the management agent installation is unsuccessful and if the management server loses connectivity with the remote computer, the temporary files may not be removed. Therefore, the LockFileTime.txt may remain in the folder on the remote computer. When the management server next tries to perform an agent installation, the management server detects the lock file. Therefore, the management agent installation is unsuccessful. http://support.microsoft.com/kb/934760/en-us
The MOM Server detected that the following services on computer "(null);NetLogon" are not running. These services are required for push agent installation. To complete this operation, either start the required services on the computer or install the MOM agent manually by using MOMAgent.msi located on the product CD. Operation: Agent Install Remote Computer Name: servername.domain.com Install account: DOMAIN\account Error Code: C000296E Error Description: Unknown error 0xC000296E	C000296E	1. Netlogon service is not running. It must be set to auto/started
The MOM Server detected that the following services on computer "winmgmt;(null)" are not running	C000296E	1. WMI services not running or WMI corrupt
The MOM Server detected that the Windows Installer service (MSIServer) is disabled on computer "servername.domain.com". This service is required for push agent installation. To complete this operation on the computer, either set the MSIServer startup type to "Manual" or "Automatic", or install the MOM agent manually by using MOMAgent.msi located on the product CD. Operation: Agent Install Install account: DOMAIN\account Error Code: C0002976 Error Description: Unknown error 0xC0002976	C0002976	1. Windows Installer service is not running or set to disabled – set this to manual or auto and start it.
The Agent Management Operation Agent Install failed for remote computer servername.domain.com. Install account: DOMAIN\account Error Code: 80070643 Error Description: Fatal error during installation. Microsoft Installer Error Description: For more information, see Windows Installer log file "C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\servernameAgentInstall.LOG C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\servernameMOMAgentMgmt.log" on the Management Server.	80070643	1. Enable the automatic Updates service…. Install the agent – then disable the auto-updates service if desired.
Call was canceled by the message filter	80010002	Install latest SP and retry. One server that failed did not have Service pack installed
The MOM Server could not find directory \\I.P.\C$\WINDOWS\. Agent will not be installed on computer "name". Please verify the required share exists.	80070006	1. Manual agent install Possible locking on registry? http://www.sysadmintales.com/category/operations-manager/ Try manual install. Verified share does not exist.
The network path was not found.	80070035	1. Manual agent install
The Agent Management Operation Agent Install failed for remote computer "name". There is not enough space on the disk.	80070070	1. Free space on install disk
The MOM Server failed to perform specified operation on computer "name". The semaphore timeout period has expired.	80070079	NSlookup failed on server. Possible DNS resolution issue. Try adding dnsname to dnssuffix search list.
The MOM Server could not start the MOMAgentInstaller service on computer "name" in the time.	8007041D 80070102	NSlookup failed on server. Possible DNS resolution issue. Verify domain is in suffix search list on management servers. Sometimes – the Windows Firewall service – even if disabled – will have a stuck rule. Run: (netsh advfirewall firewall delete rule name=”MOM Agent Installer Service”)
The Agent Management Operation Agent Install failed for remote computer "name"	80070643	1. Ensure automatic updates service is started 2. Rebuild WMI repository 3. DNS resolution issue
The Agent Management Operation Agent Install failed for remote computer "name". Another installation is already in progress.	80070652	Verify not in pending management. If yes, remove and then attempt installation again.
The MOM Server detected that computer "name" has an unsupported operating system or service pack version	80072977	Install latest SP and verify you are installing to Windows system.
Not discovered		Agent machine is not a member of domain
Ping fails		1. Server is down 2. Server is blocked by firewall 3. DNS resolving to wrong IP.
Fail to resolve machine		1. DNS issue
The MOM Server failed to perform specified operation on computer "name". Not enough server storage…	8007046A	1. This is typically a memory error caused by the remote OS that the agent is being installed on.
There are currently no logon servers available to service the logon request.	8007051F	1. Possible DNS issue
This installation package cannot be installed by the Windows Installer service. You must install a Windows service pack that contains a newer version of the Windows Installer service.	8007064D	1. Install Windows Installer 3.1
The network address is invalid	800706AB	Possible DNS name resolution issue. Tried nslookup on server name and did not get response. Verify domain is in suffix search list on management servers.
The MOM Server failed to perform specified operation on computer servername.domain.com	80070040	1. Ensure agent push account has local admin rights
The MOM Server detected that the actual NetBIOS name SERVERNAME is not same as the given NetBIOS name provide for remote computer SERVERNAME.domain.com.	80072979	1. Correct DNS/WINS issue. 2. Try pushing to NetBIOS name
The Operations Manager Server cannot process the install/uninstall request for computer xxxxxxx due to failure of operating system version verification	80070035	When Error Code: 80070035 appears with a Console based installation of the OpsMgr Agent and the targeted systems are Windows Server 2008 based systems which have their security hardened by using the Security Configuration Wizard, check to see whether the Server service is running

↧

Getting and keeping the SCOM agent on a Domain Controller – how do YOU do it?

February 20, 2009, 1:27 pm

≫ Next: The Cluster Service will automatically restart itself

≪ Previous: Console based Agent Deployment Troubleshooting table

I’d like to hear some community feedback on this….

In OpsMgr – deploying a SCOM agent to a DC often presents companies with a bit of a challenge. The reason is – in order to install software to a DC and manage it – we need rights on the DC to accomplish this. These rights are needed, anytime we are going to deploy an agent, hotfix an agent, or run a repair on a broken agent to keep the agent healthy.

When we push agents from the console, the default account used to perform the push is the Management Server Action Account. If this account does not have Domain Admin rights – the push will fail to a DC, with an Access Denied. We do allow the option to type in temporary (encrypted) credentials, which are used to deploy the agent, one time, and then are discarded. See the image below:

Here is a list of the most common options I have observed, in place at customer sites… and potential custom options that can be developed. I’d be interested in any community feedback on any options you are using, that I dont cover or haven't seen before.

1. Grant the Management Server Action account Domain Admin or Builtin\Administrators.

a. Not recommended as a best practice, this gives rights to the MSAA that are not required for day to day activities.
b. Con - SCOM Admins now control a domain admin account.

2. Grant a SCOM Administrator a special domain account, for this purpose, that is a domain admin.

a. This allows us to track the actions of that SCOM admin, when he/she uses that special privileged account.
b. That SCOM admin will be able to do repairs, hotfixes, and deployments for DC’s.
c. Con – Domain Admin teams often wont delegate these rights as they are tightly controlled.

3. The SCOM admin team delegates console based agent management to a Domain Administrator for DC agent health.

a. The domain admin must become a SCOM Admin, and therefore could potentially hurt the SCOM environment.
b. Pro – the admins in charge of the DC’s now have full responsibility to keep the agents healthy.
c. Con – the Domain Admins might not understand components of SCOM, and create something that impacts the monitoring environment.

4. The SCOM admin team must partner with the Domain Admin team, and have the Domain Administrator type in his credentials any time the SCOM administrator needs to deploy/hotfix/repair an agent on a domain controller.

a. This is a bit more labor intensive… because the SCOM admin must wait for a domain admin to be available to work on DC agents, but tight security boundaries are maintained.

5. All DC based agents will be manually installed/updated/repaired.

a. This is very common, when the two teams do not trust each other. The Domain Admin team is now required to manually deploy agents to domain controllers, and keep them up to date, and healthy.

6. Use a software deployment tool already in place to deploy/update/repair agents.

a. If a software deployment tool is already in place on DC’s, like SMS/SCCM, you can create packages to deploy, hotfix, and repair agents, similar to your patching of the OS today.

7. Customized solution: Create a Run-As account that is a domain admin, one time, for use in agent deployment/repair.

a. This involves the domain admin typing in credentials ONCE, into a RUN-AS account, which is stored securely and encrypted in the SCOM database.
b. This run-as account can be associated with a run-as profile, which is used by a custom task, which will remotely deploy the agent to the domain controller. This task will execute under the security context of the privileged run-as account.
c. The benefit is that the domain admin gets to control the password for this account, the SCOM admin does not need to know the account credentials.
d. The downside, is that this run-as account could potentially be leveraged by some other workflow, if a SCOM admin intentionally misused it…. Similar to solution #2 above.
e. This is just an idea I had – curious if anyone has already developed a solution like this?

↧

The Cluster Service will automatically restart itself

March 20, 2009, 10:02 am

≫ Next: Are your agents restarting every 10 minutes? Are you sure?

≪ Previous: Getting and keeping the SCOM agent on a Domain Controller – how do YOU do it?

Something I ran across with a customer.

There aren’t many situations where service recoveries run automatically in Microsoft MP’s, but this is one case where they do. The cluster service running is critical to a healthy cluster. In the current cluster MP, the service monitor for the cluster service will automatically start the cluster service on a node, if it detects it stops. There is a recovery action on the monitor to do just that.

As always – if you don't like this intended behavior – you can override just the recovery, and disable it.

Why do you need to know this?

Because – some service packs for clustered applications, require you to stop the cluster service, in order to apply. If you stop this service on a node while doing application maintenance, SCOM will restart it, almost immediately. The correct solution – is to use Maintenance Mode in SCOM, which will unload the monitors, and hence, any automatic recoveries will no long run. So…. make SURE you are effectively using maintenance mode if you ever need to stop your cluster service, or, disable this automatic recovery action.

↧

Are your agents restarting every 10 minutes? Are you sure?

March 26, 2009, 11:50 am

≫ Next: R2 – Improved Agent Proxy Alerts

≪ Previous: The Cluster Service will automatically restart itself

**Updated 12-21-2009

This post is OLD, and the way this process works has changed.

Please see the updated post at:

http://blogs.technet.com/kevinholman/archive/2009/12/21/the-new-and-improved-guide-on-healthservice-restarts-aka-agents-bouncing-their-own-healthservice.aspx

↧

R2 – Improved Agent Proxy Alerts

April 7, 2009, 5:46 pm

≫ Next: Health Service and MonitoringHost thresholds in R2 – how this has changed and what you should know

≪ Previous: Are your agents restarting every 10 minutes? Are you sure?

Here is a nice add in R2: When we give you the old “agent proxy alert”, we now tell you the name of the Agent that needs agent proxy enabled, and resolve the name of the object type that it was bringing in:

Nice improvement. I enable agent proxy for SQL1CLN2 and get on with my day.

↧

Health Service and MonitoringHost thresholds in R2 – how this has changed and what you should know

June 22, 2009, 3:52 pm

≫ Next: Fixing troubled agents

≪ Previous: R2 – Improved Agent Proxy Alerts

**Updated 12-21-2009

This post is OLD, and the way this process works has changed.

Please see the updated post at:

http://blogs.technet.com/kevinholman/archive/2009/12/21/the-new-and-improved-guide-on-healthservice-restarts-aka-agents-bouncing-their-own-healthservice.aspx

↧

Fixing troubled agents

October 1, 2009, 5:23 pm

≫ Next: Keep your management pack names SHORT in SP1!

≪ Previous: Health Service and MonitoringHost thresholds in R2 – how this has changed and what you should know

Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine. Agent health is an ongoing task of any OpsMgr Admin’s life.

This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time. There are so many factors in an agent’s ability to communicate and work as expected. A few key areas that commonly affect this are:

DNS name resolution (Agent to MS, and MS to Agent)
DNS domain membership (disjointed)
DNS suffix search order
Kerberos connectivity
Kerberos SPN’s accessible
Firewalls blocking 5723
Firewalls blocking access to AD for authentication
Packet loss
Invalid or old registry entries
Missing registry entries
Corrupt registry
Default agent action accounts locked down/out (HSLockdown)
HealthService Certificate configuration issues.
Hotfixes required for OS Compatibility
Management Server rejecting the agent

How do you detect agent issues from the console? The problem might be that they are not showing up in the console at all! Perhaps they might be a manual install that never shows up in Pending Actions? Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”. Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log. This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent. That is ALWAYS one of my first steps in troubleshooting.

Another way of examining Agent health – is by the built in views in OpsMgr. In the console – there is a view – Located at the following:

This view is important – because it gives us a perspective of the agent from two different points:

1. The perspective of the agent monitors running on the agent, measuring its own “health”.

2. The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server".

If any of these are red or yellow – that is an excellent place to start. This should be an area that your level 1 support for Operations manager checks DAILY. We should never have a high number of agents that are not green here. If they aren't – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

Now…. the following are some general steps to take to “fix” broken agents. These are not in definitive order. The order of steps really comes down to what you find when looking at the logs after taking these steps.

Start the HealthService on the agent. You might find the HealthService is just not running. This should not be common or systemic. Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure. However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM. Look in the OpsMgr event log for verification.

Bounce the HealthService on the agent. Sometimes this is all that is needed to resolve an agent issue. Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

Clear the HealthService queue and config (manually). This is done by stopping the HealthService. Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder. Then start the HealthService. This removes the agent config file, and the agent queue files. The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to. From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not. This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value. The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while. This is very much what happens on a new agent during initial deployment.

Clear the HealthService queue and config (from the console). When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane - “Flush Health Service State and Cache”. This will perform a very similar action to that above…. as a console task. This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server. This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.

“Repair” the agent from the console. This is done from the Administration pane – Agent Managed. You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action. A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment. It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

Reinstall the agent (manually). This would be for manual installs or when push/repair is not possible. This section is where the combination of options gets a little tricky. When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way. This means performing the following steps:

Uninstall the agent via add/remove programs.
Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe. This is designed to make sure that the service, files, and all registry entires are removed.
Ensure that the agent’s folder is removed at: \Program Files\System Center Operations Manager 2007\
Ensure that the following registry keys are deleted:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService

Reboot the agent machine (if possible)
Delete the agent from Agent Managed in the OpsMgr console. This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent. Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used) If it works correctly – you can always reinstall again using low priv or AD integration.
Remember to import certificats at this point if you are using those on the individual agent.
As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot. However – to summarize at a very general level, my typical steps are:

Review OpsMgr event log on agent
Bounce HealthService
Bounce HealthService clearing \Health Service State folder.
Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.

Also – make sure you see my other posts on agent health and troubleshooting during deployment:

Console based Agent Deployment Troubleshooting table

Agent discovery and push troubleshooting in OpsMgr 2007

Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility

Agent Pending Actions can get out of synch between the Console, and the database

Which hotfixes should I apply-

↧

Keep your management pack names SHORT in SP1!

October 2, 2009, 4:31 pm

≫ Next: 29106 event on RMS – Index was out of range. Wait. What?

≪ Previous: Fixing troubled agents

I have seen this twice now… so I will blog about it. It seems to be rare in the wild, but it will completely cripple a management group when this occurs. So beware SP1 users!

This article does not apply to R2. This is only an issue in OpsMgr 2007 SP1.

When you create your custom management packs – and especially your override management packs – keep the names as simple and short as possible. There is an issue in OpsMgr SP1 – when an agent tries to download management packs – where it will fail if the MP ID (derived from the MP Name) is too long. The worst part about this problem is there there WONT be an error logged. What will happen – is that the agent will keep trying to re-download the MP in question, and it will block ALL other MP’s from being downloaded from that point forward.

There is no simple way to know this condition is impacting you. What will happen – is that an agent will continue to work just fine… but will not get any NEW management packs. So… you might think all is well, but not so. The symptoms thatmight lead you to notice that something is wrong:

Performance data not collected for newer MP’s.
Objects not being discovered for newer MP’s
Alerts not generating as expected from newer MP’s.

The root problem is an age old Windows issue…. file paths over 255 characters not supported well. This has been resolved in R2 in how the agent copies the files over.

In both cases I have seen – someone was creating an override MP for the “IBM Hardware Management Pack for IBM System x and BladeCenter x86 Blade Systems” management pack. So – when they created their override MP – they named it something like: “Overrides - IBM Hardware Management Pack for IBM System x and BladeCenter x86 Blade Systems”

This equated to a Management Pack ID of: Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems

That MP ID is 86 characters!

What happens…. is when this management pack is created, and an override is placed in it…. (or custom rule)… the agents that require it will:

Get contacted by the RMS to update their config, and then issue a config change request (21024 in the event log)
Receive new config from the RMS (event 21025)
Process new config – and realize they need a new MP, and request that MP (event 1200)

Where this process breaks…. is the next step in the chain should be that the agent RECEIVES the MP (event 1201) and then issues a statement that the new config has become active (event 1210). The never happen.

Behind the scenes…. from looking at a ETL tracelog, we can see this is failing, when we try to move the file from the “downloaded files” folder to the “management packs” folder:

Error CMPFileManager::MoveManagementPackFile(MPFileManager_cpp383)MoveFile from '\\?\C:\Program Files\System Center Operations Manager 2007\Health Service State\Downloaded Files\MGNAME\1\Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems.{26504FED-2FF4-4AC4-A63D-59BF8C09F51F}.{7136257C-1791-7BAB-7072-2FA24284C102}.xml' to 'C:\Program Files\System Center Operations Manager 2007\Health Service State\Management Packs\Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems.{26504FED-2FF4-4AC4-A63D-59BF8C09F51F}.{7136257C-1791-7BAB-7072-2FA24284C102}.xml'failed with code 3(ERROR_PATH_NOT_FOUND).

In the example above – the bad path is:

C:\Program Files\System Center Operations Manager 2007\Health Service State\Management Packs\Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems.{26504FED-2FF4-4AC4-A63D-59BF8C09F51F}.{7136257C-1791-7BAB-7072-2FA24284C102}.xml

Which is 261 characters. The limit is 255.

Therefore – I recommend you keep your Management Pack *ID* to less than 60 characters. You can examine your long management packs by looking in the console – generally your longest display names will be the longest ID’s:

Even some Microsoft MP’s are dangerously close to the limit…. such as: Microsoft.SystemCenter.VirtualMachineManager.Pro.2008.VMWare.HostPerformance with 76 characters. In most environments you can squeak by at 79 characters…. more or less depending on where you installed your agent path.

Here is a SQL query you can run against the OpsDB to also detect this condition…. and quick check all your potentially long MP’s:

select MPName from managementpack
WHERE len(MPName) > 60

Just change 60 to whatever character count you want.

DONT freak out if you have some more than 60. Just be aware.

DO freak out if you have some more than 80!

If someone had the time and was really handy – you could write a monitor – that runs against the RMS – that would in turn query the OpsDB, and run this query, and change the RMS to unhealthy when over a threshold. THAT would be cool…. and alert you when some author makes a really long MP that has the potential to break all your agents.

Another idea I had… was to create a correlated missing event monitor…. and when you get a 1200, but do NOT get a 1201 within, say, 15 minutes….. that might be a problem. Of course if you wrote this and were already impacted…. the bad agents would never get your new MP to tell you. :-)

↧