Quantcast
Channel: Kevin Holman's System Center Blog
Viewing all 179 articles
Browse latest View live

Bulk enable of agent proxy setting

$
0
0

As you know.... this setting must be enabled for many agents under OpsMgr 2007.

Primarily - for Active Directory, Exchange, Cluster node servers, SMS, and Sharepoint 2007 server agents.

We have a ton of choices out there on how to easily enable this setting for multiple computers at once:

GUI tool:  http://blogs.msdn.com/boris_yanushpolsky/archive/2007/08/02/enabling-proxying-for-agents.aspx

Command line tool:  http://blogs.technet.com/cliveeastwood/archive/2007/08/30/operations-manager-2007-agent-proxy-command-line-tool-proxycfg.aspx

And two different PowerShell examples: http://www.systemcenterforum.org/downloads/#OperationsManager2007


Antivirus Exclusions for MOM and OpsMgr

$
0
0

Antivirus Exclusions in MOM 2005 and OpsMgr 2007: 

 

Processes:

 

Excluding by process executable is very dangerous, in that it limits the control of scanning potentially dangerous files handled by the process, because it excludes any and all files involved.  For this reason, unless absolutely necessary, we will not exclude any process executables in AV configurations for MOM servers.  If you do want to exclude the processes – they are documented below:

 

MOM 2005 – momhost.exe

OpsMgr 2007 – monitoringhost.exe

 

Exclusion by Directories:

 

Real-time, scheduled scanner and local scanner file extension specific exclusions for Operations Manager:  The directories listed here are default application directories.  You may need to modify these paths based on your client specific designs.  Only the following MOM\OpsMgr related directories should be excluded. 

Important Note: When a directory to be excluded is greater than 8 characters in length, add both the short and long file names of the directory into the exclusion list. To traverse the sub-directories, this is required by some AV programs.

 

SQL Database Servers:

These include the SQL Server database files used by Operations Manager components as well as system database files for the master database and tempdb.  To exclude these by directory, exclude the directory for the LDF and MDF files:

 

Examples:

C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data

D:\MSSQL\DATA

E:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Log

 

 

 

 

 

MOM 2005 (management servers and agents):

These include the queue and log files used by Operations Manager.

 

Example:

C:\Documents and Settings\All Users\Application Data\Microsoft\Microsoft Operations Manager\

 

 

OpsMgr 2007 (management servers and agents):

These include the queue and log files used by Operations Manager.

 

Example:

C:\Program Files\System Center Operations Manager 2007\Health Service State\Health Service Store

 

 

Exclusion of File Type by Extensions:

Real-time, scheduled scanner and local scanner file extension specific exclusions for Operations Manager: 

SQL Database Servers:

These include the SQL Server database files used by Operations Manager components as well as system database files for the master database and tempdb. 

 

Examples:

MDF, LDF

 

MOM 2005 (management servers and agents):

These include the queue and log files used by Operations Manager.

 

Example:

WKF, PQF, PQF0, PQF1

 

OpsMgr 2007 (management servers and agents):

These include the queue and log files used by Operations Manager.

 

Example:

EDB, CHK, LOG.

 

Notes:

Page files should also be excluded from any real time scanning.

Agent discovery and push troubleshooting in OpsMgr 2007

$
0
0

OpsMgr 2007 Agent troubleshooting:

There is a GREAT graphical display of the Agent discovery and push process, taken from:

http://blogs.technet.com/momteam/archive/2007/12/10/how-does-computer-discovery-work-in-opsmgr-2007.aspx

 

Agent Prerequisites:

  1. Supported Operating System Version (see below)
  2. Windows Installer 3.1
  3. MSXML 6 Parser

Agent push requirements (including firewall ports):

  • The account being used to push the agent must have local admin rights on the targeted agent machine.
  • The following ports must be open:
    • RPC endpoint mapper                              Port number: 135             Protocol: TCP/UDP
    • *RPC/DCOM High ports (2000/2003 OS)    Ports 1024-5000              Protocol: TCP/UDP
    • *RPC/DCOM High ports (2008 OS)            Ports 49152-65535           Protocol: TCP/UDP
    • NetBIOS name service                             Port number: 137             Protocol: TCP/UDP
    • NetBIOS session service                           Port number: 139             Protocol: TCP/UDP
    • SMB over IP                                            Port number: 445             Protocol: TCP
    • MOM Channel                                          Port number: 5723           Protocol: TCP/UDP
  • The following services must be set:
    • Display Name:  Netlogon                           Started                 Auto      Running
    • **Display Name:  Remote Registry            Started                 Auto      Running
    • Display Name:  Windows Installer              Started                 Manual   Running
    • Display Name:  Automatic Updates             Started                 Auto      Running

 

*The RPC/DCOM High ports are required for RPC communications.  This is generally why we don't recommend/support agent push in a heavily firewalled environment, because opening these port ranges creates a potential security issue that negates the firewall boundary.  For more information:

http://support.microsoft.com/kb/154596/

http://support.microsoft.com/default.aspx?scid=kb;EN-US;929851

Important: Don’t change the RPC high ports without have an deep understanding of your environment and the potential impact !!!

 

**Not required for agent push, but required for some management packs.

  • The remote management server must be able to connect to the remote agent machine via WMI and execute WMI Query "Select * from Win32_OperatingSystem".  WMI must be running, and healthy, and allowing remote connections.
  • The management server must be able to connect to the targeted agent machine via \\servername\c$

Logging:

  • When pushing an agent from a management server, a log will be written in the event of a failure to:  \Program Files\System Center OpsMgr\AgentManagement\AgentLogs\ on the Management Server.
  • The log on an agent is not enabled by default (like MOM 2005) when using agent push.  If you manually install an agent using the MSI – it will place a verbose logfile at C:\documents and settings\%user%\local settings\temp\momagent.log

To troubleshoot agent push with a verbose log – you need to enable verbose MSI logging:    http://support.microsoft.com/kb/314852/en-us

Common Agent Push errors:

Below are some common push failures.   Also see my troubleshooting table for more detailsConsole based Agent Deployment Troubleshooting table

The MOM Server detected that the following services on computer "(null);NetLogon" are not running. These services are required for push agent installation. To complete this operation, either start the required services on the computer or install the MOM agent manually by using MOMAgent.msi located on the product CD. Operation: Agent Install

Remote Computer Name: dc1.opsmgr.net

Install account: OPSMGR\localadmin

Error Code: C000296E

Error Description: Unknown error 0xC000296E

Solution: Netlogon service is not running.  It must be set to auto/started

The MOM Server detected that the Windows Installer service (MSIServer) is disabled on computer "dc1.opsmgr.net". This service is required for push agent installation. To complete this operation on the computer, either set the MSIServer startup type to "Manual" or "Automatic", or install the MOM agent manually by using MOMAgent.msi located on the product CD.

Operation: Agent Install

Install account: OPSMGR\localadmin

Error Code: C0002976

Error Description: Unknown error 0xC0002976

Solution:  Windows Installer service is not running or set to disabled – set this to manual or auto and start it.

The Agent Management Operation Agent Install failed for remote computer dc1.opsmgr.net.

Install account: OPSMGR\localadmin

Error Code: 80070643

Error Description: Fatal error during installation.

Microsoft Installer Error Description:

For more information, see Windows Installer log file "C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\DC1AgentInstall.LOG

C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\DC1MOMAgentMgmt.log" on the Management Server.

Solution:  Enable the automatic Updates service…. Install the agent – then disable the auto-updates service if desired.

 

 

Additional Info:

There are sub-components to the OpsMgr Agent installer service

1.       The service is a standard NT Service. The service also handles registration/un-registration of DCOM object that has logic for handling MSI/MSP.

2.       The DCOM object takes directive from the module on OpsMgr Server, this object provides asynchronously installing/uninstalling/updating OpsMgr. It also returns list of currently installed QFEs, verifies pre-requisites like channel connectivity before completing agent install. It handles multi-homing of agent, and reads agent parameters such as version, install dir, etc.

3.       RPC is used to establish a connection to the target machine, SMB is used to copy the source files over.

4.       WMI is used to check prerequisites.

Agents Inside a Trust Boundary

Discovery:
Discovery requires that the TCP 135 (RPC), RPC range, and TCP 445 (SMB) ports remain open and that the SMB service is enabled.

Installation:
After a target device has been discovered, an agent can be deployed to it. Agent installation requires:

  • Opening Remote procedure call (RPC) ports beginning with endpoint mapper TCP 135 and the Server Message Block (SMB) port TCP/UDP 445.
  • Enabling the File and Printer Sharing for Microsoft Networks and the Client for Microsoft Networks services (this ensures that the SMB port is active).
  • If enabled, Windows Firewall Group Policy settings for Allow remote administration exception and Allow file and printer sharing exception must be set to Allow unsolicited incoming messages from: to the IP address and subnets for the primary and secondary Management Servers for the agent. For more information, see How to Configure the
  • Windows Firewall to Enable Management of Windows-Based Computers from the Operations Manager 2007 Operations Console.
  • An account that has local administrator rights on the target computer.
  • Windows Installer 3.1. To install, see article 893803 in the Microsoft Knowledge Base (http://go.microsoft.com/fwlink/?LinkId=86322).
  • Microsoft Core XML services (MSXML) 6 on the Operations Manager product installation media in the \msxml sub directory.

Ongoing Management:
Ongoing management of an agent requires that the TCP 135 (RPC), RPC range, and TCP 445 (SMB) ports remain open and that the SMB service remains enabled.

Supported Operating systems for an Agent:

See:  Operations Manager 2007 R2 Supported Configurations

 

Installing the OpsMgr 2007 agent on an ISA 2004 or ISA 2006 server

$
0
0

When you want to manage and monitor an ISA server, you need to install the OpsMgr agent.

However, there is no guide published for the OpsMgr 2007 ISA MP.....   It comes with the MOM 2005 guide.  In ISA, there was a system policy which you could enable for MOM.  This would open the necessary ports for the MOM agent to communicate with a management server.  However, these ports have changed, yet there seems to be no guidance on how to manage an ISA box with SCOM.

I will document the steps necessary:

When you install an OpsMgr agent on a ISA server, you will see in the event log the following event, when the agent starts:

 

------------------

Event Type:    Error
Event Source:    OpsMgr Connector
Event Category:    None
Event ID:    21006
Date:        2/11/2008
Time:        11:05:36 AM
User:        N/A
Computer:    ISA
Description:
The OpsMgr Connector could not connect to OMRMS:5723.  The error code is 10065L(A socket operation was attempted to an unreachable host.
).  Please verify there is network connectivity, the server is running and has registered it's listening port, and there are no firewalls blocking traffic to the destination.

-------------------

 

There is the problem!  We DO have a firewall blocking the traffic.

The OpsMgr agent needs to be able to communicate, outbound, to a management server over TCP_5723.  We will use this for all communications, including heartbeats.  Therefore, we need an access rule to allow this traffic:

1.  Create a new access rule.  Give it a name according to your corporate ISA rule naming standards.  Click Next:

image

 

2.  Choose "Allow" and click Next.

3.  On Protocols, click "Add", then "New" then "Protocol".  Give this new protocol a new, such as "OpsMgr Agent tcp_5723"

4.  On the "New Protocol Definition Wizard" screen, click "New" and fill out the boxes.  We want TCP, Outbound, and port 5723.  Then click OK.

image

 

5.  Click Next, No secondary connections, and then click Finish.

6.  Find your new protocol under User Defined, and click "Add", then Close, then Next.

7.  On the access rule sources - we want FROM "Localhost", which is located under the "Networks" object:

image

 

8.  On the "Access Rule Destinations" - we want the IP addresses of all possible OpsMgr management servers/gateways that this ISA will report, or fail over to.  For this example, I am using the "Internal" network object, which includes all internally defined IP subnets:

image

 

Accepts the default settings for "All Users", click Next, Finish, then apply this new rule to the firewall configuration.

 

You should no longer see an Event ID 21006, after bouncing the Healthservice on the ISA server.  However, in order to support mutual authentication, you might still need to configure Certificates, or rules allowing AD communications if the ISA server is a member of the same forest as the OpsMgr servers.

Event ID 2115 A Bind Data Source in Management Group

$
0
0

I see this event a lot in customer environments.  I am not an expert on troubleshooting this here... but saw this post in the MS newsgroups and felt it was worth capturing....

My experience has been that it is MUCH more common to see these when there is a management pack that collects way too much discovery data.... than any real performance problem with the data warehouse.  In most cases.... if the issue just started after bringing in a new MP.... deleting that MP solves the problem.  I have seen this repeatedly after importing the Cluster MP, Or Exchange 2007 MP.... but haven't been able to fully investigate the root cause yet:

 

In a nutshell.... if they are happening just a couple times an hour.... and the time in seconds is fairly low (under a few minutes) then this is normal. 

If they are happening very frequently - like every minute, and the times are increasing - then there is an issue that needs to be resolved.

 

Taken from the newsgroups:

-------------------------------------------

In OpsMgr 2007 one of the performance concerns is DB/DW data insertion performance. Here is a description of how to identify and troubleshoot problems with DB/DW data insertion.

Symptoms:

DB/DW write action workflows run on a Management Server, they first keep data received from Agent / Gateway in an internal buffer, then they create a batch of data from the buffer and insert the data batch to DB / DW, when the insertion of the first batch finished, they will create another batch and insert it to DB / DW. The size of the batch depends on how much data is available in the buffer when the batch is created, but there is a maximum limit on the size of the batch, a batch can contain up to 5000 data items.  If data item incoming (from Agent / Gateway) throughput becomes larger, or the data item insertion (to DB/DW) throughput becomes smaller, then the buffer will tend to accumulate more data and the batch size will tend to become larger.  There are different write action workflows running on a MS, they handle data insertion to DB / DW for different type of data:

  • Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
  • Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
  • Microsoft.SystemCenter.DataWarehouse.CollectEventData
  • Microsoft.SystemCenter.CollectAlerts
  • Microsoft.SystemCenter.CollectEntityState
  • Microsoft.SystemCenter.CollectPublishedEntityState
  • Microsoft.SystemCenter.CollectDiscoveryData
  • Microsoft.SystemCenter.CollectSignatureData
  • Microsoft.SystemCenter.CollectEventData

When a DB/DW write action workflow on Management Server notices that the insertion of a single data batch is slow (ie. slower than 1 minute), it will start to log a 2115 NT event to OpsMgr NT event log once every minute until the batch is inserted to DB / DW or is dropped by DB / DW write action module.  So you will see 2115 events in management server's "Operations Manager" NT event log when it is slow to insert data to DB /DW.  You might also see 2115 events when there is a burst of data items coming to
Management server and the number of data items in a batch is large.  (This can happen during a large amount of discovery data being inserted - from a freshly imported or noisy management pack.)

2115 events have 2 import pieces of information: the name of the workflow that has insertion problem, and the pending time since the workflow started inserting last data batch.  Here is an example of a 2115 event:

------------------------------------

A Bind Data Source in Management Group OpsMgr07PREMT01 has posted items to the workflow, but has not received a response in 3600 seconds.  This indicates a performance or functional problem with the workflow.

Workflow Id : Microsoft.SystemCenter.CollectSignatureData

Instance    : MOMPREMSMT02.redmond.corp.microsoft.com

Instance Id : {6D52A6BB-9535-9136-0EF2-128511F264C4}

------------------------------------------

This 2115 event is saying DB write action workflow "Microsoft.SystemCenter.CollectSignatureData" (which writes performance
signature data to DB) is trying to insert a batch of signature data to DB and it started inserting 3600 seconds ago but the insertion has not finished yet. Normally inserting of a batch should finish within 1 minutes.

Normally, there should not be much 2115 events happening on Management server, if it happens less than 1 or 2 times every hour (per write action workflow), then it is not a big concern, but if it happens more than that, there is a DB /DW insertion problem.

The following performance counters on Management Server gives information of DB / DW write action insertion batch size and insertion time, if batch size is becoming larger (by default maximum batch size is 5000), it means management server is either slow in inserting data to DB/DW or is getting a burst of data items from Agent/Gateway. From the DB / DW write action's Avg. Processing Time, you will see how much time it takes to write a batch of data to DB / DW.

  • OpsMgr DB Write Action Modules(*)\Avg. Batch Size
  • OpsMgr DB Write Action Modules(*)\Avg. Processing Time
  • OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms
  • OpsMgr DW Writer Module(*)\Avg. Batch Size

Possible root causes:

  • In OpsMgr, discovery data insertion is relatively expensive, so a discovery burst (a discovery burst is referring to a short period of time when a lot of discovery data is received by management server) could cause 2115 event (complaining about slow insertion of discovery data), since discovery insertion should not happen frequently.  So if you see consistently 2115 events for discovery data collection. That means you either have DB /DW insertion problem or some discovery rules in a MP is collecting too much
    discovery data.
  • OpsMgr Config update caused by instance space change or MP import will impact the CPU utilization on DB and will have impact on DB data insertion.  After importing a new MP or after a big instance space change in a large environment,  you will probably see more than normal 2115 events.
  • Expensive UI queries can impact the resource utilization on DB and could have impact on DB data insertion. When user is doing expensive UI operation, you will probably see more than normal 2115 events.
  • When DB / DW is out of space / offline you will find Management server keeps logging 2115 events to NT event log and the pending time is becoming higher and higher.
  • Sometimes invalid data item sent from agent /Gateway will cause DB / DW insertion error which will end up with 2115 event complaining about DB /DW slow insertion. In this case please check the OpsMgr event log for relevant error events. It's more common in DW write action workflows.
  • If DB / DW hardware is not configured properly, there could be performance issue,  and it could cause slow data insertion to DB / DW. The problem could be: 
    • The network link between DB / DW to MS is slow (either bandwidth is small / latency is large, as a best practice we recommend MS to be in the same LAN as DB/DW).
    • The data / log / tempdb disk used by DB / DW is slow, (we recommend separating data, log and tempdb to different disks, we recommend using RAID 10 instead of using RAID 5, we also recommend turning on write cache of the array controllers). 
    • The OpsDB tables are too fragmented (this is a common cause of DB performance issues).  Reindex affected tables will solve this issue.
    • The DB / DW does not have enough memory.

 

Now - that is the GENERAL synopsis and how to attack them.  Next - we will cover a specific issue we are seeing with a specific type of 2115 Event:

-----------------------------------------------

It appears we may be hitting cache resolution error we were trying to catch for a while. This is about CollectEventData workflow.  Error is very hard to catch and we're including a fix in SP2 to avoid it.  There are two ways to resolve the problem in the meantime.  Since the error happens very rarely, you can just restart Health Service on the Management Server that is affected.  Or you can prevent it from blocking the workflow by creating overrides in the following way:

-----------------------------------------------


1) Launch Console, switch to Authoring space and click "Rules"
2) In the right top hand side of the screen click "Change Scope"
3) Select "Data Warehouse Connection Server" in the list of types,. click "Ok"
4) Find "Event data collector" rule in the list of rules;
5) Right click "Event data collector" rule, select Overrides/Override the Rule/For all objects of type...
6) Set Max Execution Attempt Count to 10
7) Set Execution Attempt Timeout Interval Seconds to 6

That way if DW event writer fails to process event batch for ~ a minute, it will discard the batch.  2115 events related to
Datawarehouse.CollectEventData should go away after you apply these overrides.  BTW, while you're at it you may want to override "Max Batches To Process Before Maintenance Count" to 50 if you have a relatively large environment.  We think 50 is better default setting then SP1's 20 in this case and we'll switch default to 50 in SP2.

-------------------------------------------------

 

Essentially - to know if you are affected by the specific 2115 issue describe above - here is the criteria:

 

1.  You are seeing 2115 bind events in the OpsMgr event log of the RMS or MS, and they are recurring every minute.

2.  The events have a Workflow ID of:  Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData

3.  The "has not received a response" time is increasing, and growing to be a very large number over time.

 

Here is an example of a MS with the problem:  Note consecutive events, from the CollectEventData workflow, occurring every minute, with the time being a large number and increasing:

 

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:37:06 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706594 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

 

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:36:05 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706533 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

 

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:35:03 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706471 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

How do I know which hotfixes have been applied to which agents?

$
0
0

***UPDATE***  A new hotfix has been released, which is a simple updated management pack.... which fixes the Patchlist table to include all hotfixes, and cleans up the formatting.  I recommend you get it and install it on your SP1 environments.

http://support.microsoft.com/kb/958253 

------------------------------------------------------------------------------------- 

As more hot-fixes are applied to our OpsMgr 2007 SP1 environments.... how can we know which hot-fixes have been applied to our agents?  How can we detect an agent that needs patching but got missed?

In MOM 2005... this was rather simple... in the Admin console, under Agent-managed Computers, there was a column called "version" which incremented the agent version number in most cases.

In OpsMgr... we do not update this field in the Administration tab.  See graphic:  The version here shows the major version number... like RTM 6.0.6500, SP1 6.0.6278.... etc....

image

So.... how do we examine this now for minor updates?

Create a new State view.  Call it "Custom - Agent Patch List" or something you like.  Target "Health Service" for "Show Data Related To".  You can filter it further to the "Agent Managed Computer Group".

Then - personalize this view, and show the columns for "Name" and "Patch List"  See graphic:

image

Now.... the "Patch List" column isn't super user friendly - because of the amount of text in the single column.... but it will let you see what has been installed.  For instance - here is an example of KB950853 installed:

image

To make this a bit easier.... I wrote the following SQL query which does essentially the same thing.... you can create a web based SQL report from this and the data will be much easier to manage in Excel:

select bme.path AS 'Agent Name', hs.patchlist AS 'Patch List' from MT_HealthService hs
inner join BaseManagedEntity bme on hs.BaseManagedEntityId = bme.BaseManagedEntityId
order by path

If you want to query for all agents missing a specific hot-fix... you could run a query like this.... just change the KB number below (thanks to Brad Turner for providing the idea):

select bme.path AS 'Agent Name', hs.patchlist AS 'Patch List' from MT_HealthService hs
inner join BaseManagedEntity bme on hs.BaseManagedEntityId = bme.BaseManagedEntityId
where hs.patchlist not like '%951380%'
order by path


I have noticed, however, that this field, "Patch List" is limited to 255 characters in the database.... which I imagine will run out of space fairly soon.  I will also be interested to see how we handle this table column, once SP2 comes out.... as any pre-SP2 applied hotfixes will no longer apply.

The Patch List information is discovered and updated once per day across all agents in the management group.

 

For a report which shows you the same information, but lets you query for all agent missing a specific hotfix - check out my more recent post with the report download:

http://blogs.technet.com/kevinholman/archive/2008/06/27/a-report-to-show-all-agents-missing-a-specific-hotfix.aspx

 

Agent Proxy alerts - finding the right machine to enable agent proxy on using a custom report

$
0
0

Certain types of agents need the agent proxy setting enabled.  These are documented in various guides... such as Exchange Active Directory, Cluster nodes, etc...

However, sometimes, we still get alerts that Agent Proxy needs to be enabled for a HealthService.  The problem is... the Alert often doesn't tell us which agent needs this enabled!!!

Here is an example alert:

 

image

 

The alert context tab is telling us about something called "SQLCLUSTER".... but I know that is a virtual cluster instance name... not the name of a real agent.

 

image

 

Marius blogged about a SQL query... that will help us find the agent that needs this turned on:

http://blogs.msdn.com/mariussutara/archive/2007/11/09/agent-proxying-alert.aspx

 

The sql query I like to use is:

select DisplayName, Path, basemanagedentityid from basemanagedentity where basemanagedentityid = 'guid'

Where "guid" = the GUID of the healthservice in the alert.

 

So in my example above - the part in bold is the GUID we need.....  the HealthService that is causing the problem....

Health service ( 4F6BCCD4-2A41-1C39-DC50-5CE6CA10E0D3 ) should not generate data about this managed object ( A6D9CC33-3EF7-00BF-3E78-B368B32F1486 ).

If we drop this into the query.... it will look like so:

select DisplayName, Path, basemanagedentityid from basemanagedentity where basemanagedentityid = '4F6BCCD4-2A41-1C39-DC50-5CE6CA10E0D3'

Which when run in a SQL query returns:

 

DisplayName    Path    basemanagedentityid
sqlnode2.opsmgr.net sqlnode2.opsmgr.net 4F6BCCD4-2A41-1C39-DC50-5CE6CA10E0D3

 

Aha!  So - SQLNODE2 needs agent proxy enabled.

Well.... how instead of all this dropping to SQL.... we create a report with the query above, and have an input parameter where you can just paste in the GUID from the alert?  I have created just that, and you can download it below!

Please see my previous post on creating a data source for the OpsDB here:  Creating a new data source for reporting against the Operational Database  You will need to do that first in order to see this report.... since it runs against the Operational Database, not the Data Warehouse.

Simply download this RDL file, then browse to your reporting website (http://reportingservername/reports) browse to your new custom folder for reports, and choose "Upload File".  Your new report is uploaded, and you should be able to see it in the Ops Console under Reporting now:

 

image

 

Open the report.  The report just needs you to paste in the healthservice GUID you saw in the Agent Proxy alert:

 

image

 

This should make it a bit easier and faster to tackle these types of alerts in the future.... for the few agents you are missing this setting on.

 

 

A simpler way to do this without running a report.... is to use "Discovered Inventory" in the monitoring console.

 

image

 

Select "Change Target Type" from the actions pane, and then choose "View All Targets" and then type "Health Service Watcher" in the "Look For" box.  Select the Health Service Watcher Class and click OK.

 

image

 

Now paste your GUID into the "FIND" box and click OK.  Make sure you dont have any trainling spaces in the GUID:

 

image

 

 

 

The Report download is here:

Which servers are DOWN in my company, and which just have a heartbeat failure, RIGHT NOW?

$
0
0

 

 

 

 

 

In OpsMgr 2007, when a agent experiences a heartbeat failure, several things happen.  There are diagnostics, and possibly recoveries that are run.  Alerts, and possibly notifications go out.

But what happens if my Operations team misses on of these alerts?  What can I do to "spot check" agents with issues?

Well, any time an agent has a heartbeat failure, we gray out the state icon of the agents last known state for in each state view. 

However - you CAN create a State view that will turn Red or Yellow just like any other state views.  Simply create a new State View, and scope the class to Health Service Watcher (Agent).

I called mine Heartbeat State View:

image

This view will show us when any of the agent health service watcher monitors are unhealthy:  In my case - OWA and EXCH1 have issues.  OWA is DOWN, while EXCH1 agent healthservice is stopped.

image

However - here is the issue.  This view shows us when ANY monitor rolls up unhealthy state.... this includes heartbeat failures AND computer unreachable (server IP stack is down):

image

What if I want a State View - to ONLY show me computers that are DOWN.... as in... not heartbeating AND not responding to any PING?  Most customers consider this their "most critical situation".  Well, I haven't found an easy way to do that.... so I wrote a report which handles it.  This report will query the OpsDB for the state of the "Computer Not Reachable" monitor, and only display those servers.  It is based on the following query:

SELECT bme.DisplayName, s.LastModified as LastModifiedUTC, dateadd(hh,-5,s.LastModified) as 'LastModifiedCST (GMT-5)'
FROM state AS s, BaseManagedEntity as bme
WHERE s.basemanagedentityid = bme.basemanagedentityid AND s.monitorid
IN (SELECT MonitorId FROM Monitor WHERE MonitorName = 'Microsoft.SystemCenter.HealthService.ComputerDown')
AND s.Healthstate = '3' AND bme.IsDeleted = '0'
ORDER BY s.Lastmodified DESC

You can import this report if you have created a data source as shown in my previous post: 

http://blogs.technet.com/kevinholman/archive/2008/06/27/creating-a-new-data-source-for-reporting-against-the-operational-database.aspx

Import this report into your custom folder... and run it.  You can schedule it to receive it first thing every day... if you like the output:

image

*****  Update 6-30-08  I removed a section of the original query relating to maintenance mode.  We found that if a down server had never been in maintenance mode, the server would not show up in the report.  The query and report download have been updated to address this.

Report is attached below:


A report to show all agents missing a specific hotfix

$
0
0

This is a continuation of my previous post on determining which agents are missing a hot-fix:

How do I know which hotfixes have been applied to which agents-

I wrote up a report that allows you to paste in a KB article number into the report as a parameter, and then it will show all agents that are potentially missing that hotfix.  This will help you easily find agent which need to be patched and got missed for some reason.

You can run this report if you create the SQL reporting data source as specified in my previous post:

Creating a new data source for reporting against the Operational Database

Once imported - it will show up in the console.  Open the report, and paste in any KB article number for a OpsMgr hotfix you have applied.  The number MUST begin and end with "%".... such as %951380% as shown:

 

image

The report is attached below:

Agent Pending Actions can get out of synch between the Console, and the database

$
0
0

When you look at your agent pending actions in the Administration pane of the console.... you will see pending actions for things like approving a manual agent install, agent installation in progress, approving agent updates, like from a hotfix, etc.

 

This pending action information is also contained in the SQL table in the OpsDB - agentpendingaction

 

It is possible for the agentpendingaction table to get out of synch with the console, for instance, if the server was in the middle of updating/installing an agent - and the management server Healthservice process crashed or was killed.

 

In this case, you might have a lingering pending action, that blocks you from doing something in the future.  For instance - if you had a pending action to install an agent, that did not show up in the pending actions view of the console.  What might happen, is that when you attempt to discover and push the agent to this same server, you get an error message:

 

"One or more computers you are trying to manage are already in the process of being managed.  Please resolve these issues via the Pending Management view in Administration, prior to attempting to manage them again"

ss2

 

The problem is - they don't show up in this view!

 

To view the database information on pending actions:

select * from agentpendingaction

You should be able to find your pending action there - that does not show up in the Pending Action view in the console, if you are affected by this.

 

To resolve - we should first try and reject these "ghost" pending actions via the SDK... using powershell.  Open a command shell, and run the following:

get-agentpendingaction

To see a prettier view:

get-agentpendingaction | ft agentname,agentpendingactiontype

To see a specific pending action for a specific agent:

get-agentPendingAction | where {$_.AgentName -eq "servername.domain.com"}

To reject the specific pending action:

get-agentPendingAction | where {$_.AgentName -eq "servername.domain.com"}|Reject-agentPendingAction

We can use the last line - to reject the specific pending action we are interested in.

 

You might get an exception running this:

Reject-AgentPendingAction : Microsoft.EnterpriseManagement.Common.UnknownServiceE
xception: The service threw an unknown exception. See inner exception for details
. ---> System.ServiceModel.FaultException`1[System.ServiceModel.ExceptionDetail]:
Exception of type 'Microsoft.EnterpriseManagement.Common.DataItemDoesNotExistExc
eption' was thrown.

If this fails, such as gives an exception, or if our problem pending action doesn't even show up in Powershell.... we have to drop down to the SQL database level.  This is a LAST resort and NOT SUPPORTED.... run at your own risk.

There is a stored procedure to delete pending actions.... here is an example, to run in a SQL query window:

exec p_AgentPendingActionDeleteByAgentName 'agentname.domain.com'

Change 'agentname.domain.com' to the agent name that is showing up in the SQL table, but not in the console view.

Console based Agent Deployment Troubleshooting table

$
0
0

This post is a list of common agent push deployment errors… and some possible remediation options.

 

 

Most common errors while pushing an agent:

 

Error Error Code(s) Remediation Steps

The MOM Server could not execute WMI Query "Select * from Win32_Environment where
NAME='PROCESSOR_ARCHITECTURE'" on computer server.domain.com

Operation: Agent Install
Install account: domain\account
Error Code: 80004005
Error Description: Unspecified error

80004005

1.  Check the PATH environment variable.  If the PATH statement is very long, due to lots of installed third party software - this can fail.  Reduce the path by converting any long filename destinations to 8.3, and remove any path statements that are not necessary.  Or apply hotfix:  http://support.microsoft.com/?id=969572

2.  The cause could be corrupted Performance Counters on the target Agent.

To rebuild all Performance counters including extensible and third party counters in Windows Server 2003, type the following commands at a command prompt. Press ENTER after each command.
cd \windows\system32
lodctr /R
Note /R is uppercase.
Windows Server 2003 rebuilds all the counters because it reads all the .ini files in the C:\Windows\inf\009 folder for the English operating system.

How to manually rebuild Performance Counter Library values
http://support.microsoft.com/kb/300956

3.  Manual agent install. 

The MOM Server could not execute WMI Query "Select * from Win32_OperatingSystem" on
computer “servername.domain.com”
Operation: Agent Install
Install account: DOMAIN\account
Error Code: 800706BA
Error Description: The RPC server is unavailable.

The MOM Server could not execute WMI Query "(null)” on
computer “servername.domain.com”
Operation: Agent Install
Install account: DOMAIN\account
Error Code: 800706BA
Error Description: The RPC server is unavailable.

8004100A

800706BA

1.  Ensure agent push account has local admin rights

2.  Firewall is blocking NetBIOS access.  If Windows 2008 firewall is enabled, ensure “Remote Administration (RPC)” rule is enabled/allowed.  We need port 135 (RPC) and the DCOM port range opened for console push through a firewall. 

3.  Inspect WMI service, health, and rebuild repository if necessary

4.  Firewall is blocking ICMP  (Live OneCare)

5.  DNS incorrect

The MOM Server failed to open service control manager on computer "servername.domain.com". Access is Denied
Operation: Agent Install
Install account: DomainName\User Account
Error Code: 80070005
Error Description: Access is denied.

80070005

80041002

1.  Verify SCOM agent push account is in Local Administrators group on target computer.

2.  On Domain controllers will have to work with AD team to install agent manually if agent push account is not a domain admin.

3.  Disable McAfee antivirus during push

The MOM Server failed to open service control manager on computer "servername.domain.com".
Therefore, the MOM Server cannot complete configuration of agent on the computer.
Operation: Agent Install
Install account: DOMAIN\account
Error Code: 800706BA
Error Description: The RPC server is unavailable.

800706BA 1.  Firewall blocking NetBIOS ports

2.  DNS resolution issue.  Make sure the agent can ping the MS by NetBIOS and FQDN.  Make sure the MS can ping the agent by NetBIOS and FQDN

3.  Firewall blocking ICMP

4.  RPC services stopped.

The MOM Server failed to acquire lock to remote computer servername.domain.com. This means there is already an agent management operation proceeding on this computer, please retry the Push Agent operation after some time.
Operation: Agent Install
Install account: DOMAIN\account
Error Code: 80072971
Error description: Unknown error 0x80072971

80072971

This problem occurs if the LockFileTime.txt file is located in the following folder on the remote computer:
%windir%\422C3AB1-32E0-4411-BF66-A84FEEFCC8E2
When you install or remove a management agent, the Operations Manager 2007 management server copies temporary files to the remote computer. One of these files is named LockFileTime.txt. This lock file is intended to prevent another management server from performing a management agent installation at the same time as the current installation. If the management agent installation is unsuccessful and if the management server loses connectivity with the remote computer, the temporary files may not be removed. Therefore, the LockFileTime.txt may remain in the folder on the remote computer. When the management server next tries to perform an agent installation, the management server detects the lock file. Therefore, the management agent installation is unsuccessful.

http://support.microsoft.com/kb/934760/en-us

The MOM Server detected that the following services on computer "(null);NetLogon" are not running. These services are required for push agent installation. To complete this operation, either start the required services on the computer or install the MOM agent manually by using MOMAgent.msi located on the product CD.
Operation: Agent Install
Remote Computer Name: servername.domain.com Install account: DOMAIN\account
Error Code: C000296E
Error Description: Unknown error 0xC000296E

C000296E

1.  Netlogon service is not running.  It must be set to auto/started

The MOM Server detected that the following services on computer
"winmgmt;(null)" are not running

C000296E 1.  WMI services not running or WMI corrupt

The MOM Server detected that the Windows Installer service (MSIServer) is disabled on computer "servername.domain.com". This service is required for push agent installation. To complete this operation on the computer, either set the MSIServer startup type to "Manual" or "Automatic", or install the MOM agent manually by using MOMAgent.msi located on the product CD.
Operation: Agent Install
Install account: DOMAIN\account
Error Code: C0002976
Error Description: Unknown error 0xC0002976

C0002976

1.  Windows Installer service is not running or set to disabled – set this to manual or auto and start it.

The Agent Management Operation Agent Install failed for remote computer servername.domain.com.
Install account: DOMAIN\account
Error Code: 80070643
Error Description: Fatal error during installation.
Microsoft Installer Error Description:
For more information, see Windows Installer log file "C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\servernameAgentInstall.LOG
C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs\servernameMOMAgentMgmt.log" on the Management Server.

80070643

1.  Enable the automatic Updates service…. Install the agent – then disable the auto-updates service if desired.
Call was canceled by the message filter 80010002 Install latest SP and retry. One server that failed did not have Service pack installed
The MOM Server could not find directory \\I.P.\C$\WINDOWS\. Agent will not be installed on computer "name". Please verify the required share exists. 80070006

1.  Manual agent install

Possible locking on registry?

http://www.sysadmintales.com/category/operations-manager/

Try manual install.

Verified share does not exist.

The network path was not found. 80070035 1.  Manual agent install
The Agent Management Operation Agent Install failed for remote computer "name". There is not enough space on the disk. 80070070 1.  Free space on install disk
The MOM Server failed to perform specified operation on computer "name". The semaphore timeout period has expired. 80070079

NSlookup failed on server. Possible DNS resolution issue.

Try adding dnsname to dnssuffix search list.

The MOM Server could not start the MOMAgentInstaller service on computer "name" in the time. 8007041D

80070102

NSlookup failed on server. Possible DNS resolution issue.

Verify domain is in suffix search list on management servers.

Sometimes – the Windows Firewall service – even if disabled – will have a stuck rule.  Run:  (netsh advfirewall firewall delete rule name=”MOM Agent Installer Service”)

The Agent Management Operation Agent Install failed for remote computer "name" 80070643 1.  Ensure automatic updates service is started
2.  Rebuild WMI repository
3.  DNS resolution issue
The Agent Management Operation Agent Install failed for remote computer "name". Another installation is already in progress. 80070652

Verify not in pending management. If yes, remove and then attempt installation again.

The MOM Server detected that computer "name" has an unsupported operating system or service pack version 80072977 Install latest SP and verify you are installing to Windows system.
Not discovered   Agent machine is not a member of domain
Ping fails   1.  Server is down
2.  Server is blocked by firewall
3.  DNS resolving to wrong IP.
Fail to resolve machine   1.  DNS issue
The MOM Server failed to perform specified operation on computer "name". Not enough server storage… 8007046A 1.  This is typically a memory error caused by the remote OS that the agent is being installed on.
There are currently no logon servers available to service the logon request. 8007051F 1.  Possible DNS issue
This installation package cannot be installed by the Windows Installer service. You must install a Windows service pack that contains a newer version of the Windows Installer service. 8007064D 1.  Install Windows Installer 3.1
The network address is invalid 800706AB

Possible DNS name resolution issue.

Tried nslookup on server name and did not get response.

Verify domain is in suffix search list on management servers.

The MOM Server failed to perform specified operation on computer servername.domain.com 80070040 1.  Ensure agent push account has local admin rights
The MOM Server detected that the actual NetBIOS name SERVERNAME is not same as the given NetBIOS name provide for remote computer SERVERNAME.domain.com. 80072979 1.  Correct DNS/WINS issue.
2.  Try pushing to NetBIOS name
The Operations Manager Server cannot process the install/uninstall request for computer xxxxxxx due to failure of operating system version verification 80070035 When Error Code: 80070035 appears with a Console based installation of the OpsMgr Agent and the targeted systems are Windows Server 2008 based systems which have their security hardened by using the Security Configuration Wizard, check to see whether the Server service is running
     

Getting and keeping the SCOM agent on a Domain Controller – how do YOU do it?

$
0
0

I’d like to hear some community feedback on this….

 

In OpsMgr – deploying a SCOM agent to a DC often presents companies with a bit of a challenge.  The reason is – in order to install software to a DC and manage it – we need rights on the DC to accomplish this.  These rights are needed, anytime we are going to deploy an agent, hotfix an agent, or run a repair on a broken agent to keep the agent healthy.

When we push agents from the console, the default account used to perform the push is the Management Server Action Account.  If this account does not have Domain Admin rights – the push will fail to a DC, with an Access Denied.  We do allow the option to type in temporary (encrypted) credentials, which are used to deploy the agent, one time, and then are discarded.  See the image below:

clip_image002

 

Here is a list of the most common options I have observed, in place at customer sites… and potential custom options that can be developed.  I’d be interested in any community feedback on any options you are using, that I dont cover or haven't seen before.

 

 

1. Grant the Management Server Action account Domain Admin or Builtin\Administrators.

a. Not recommended as a best practice, this gives rights to the MSAA that are not required for day to day activities.

b. Con - SCOM Admins now control a domain admin account.

 

2. Grant a SCOM Administrator a special domain account, for this purpose, that is a domain admin.

a. This allows us to track the actions of that SCOM admin, when he/she uses that special privileged account.

b. That SCOM admin will be able to do repairs, hotfixes, and deployments for DC’s.

c.  Con – Domain Admin teams often wont delegate these rights as they are tightly controlled.

 

3. The SCOM admin team delegates console based agent management to a Domain Administrator for DC agent health.

a.  The domain admin must become a SCOM Admin, and therefore could potentially hurt the SCOM environment.

b.  Pro – the admins in charge of the DC’s now have full responsibility to keep the agents healthy.

c.  Con – the Domain Admins might not understand components of SCOM, and create something that impacts the monitoring environment.

 

4. The SCOM admin team must partner with the Domain Admin team, and have the Domain Administrator type in his credentials any time the SCOM administrator needs to deploy/hotfix/repair an agent on a domain controller.

a. This is a bit more labor intensive… because the SCOM admin must wait for a domain admin to be available to work on DC agents, but tight security boundaries are maintained.

 

5. All DC based agents will be manually installed/updated/repaired.

a. This is very common, when the two teams do not trust each other.  The Domain Admin team is now required to manually deploy agents to domain controllers, and keep them up to date, and healthy.

 

6. Use a software deployment tool already in place to deploy/update/repair agents.

a. If a software deployment tool is already in place on DC’s, like SMS/SCCM, you can create packages to deploy, hotfix, and repair agents, similar to your patching of the OS today.

 

7. Customized solution:  Create a Run-As account that is a domain admin, one time, for use in agent deployment/repair.

a. This involves the domain admin typing in credentials ONCE, into a RUN-AS account, which is stored securely and encrypted in the SCOM database. 

b. This run-as account can be associated with a run-as profile, which is used by a custom task, which will remotely deploy the agent to the domain controller.  This task will execute under the security context of the privileged run-as account.

c. The benefit is that the domain admin gets to control the password for this account, the SCOM admin does not need to know the account credentials.

d. The downside, is that this run-as account could potentially be leveraged by some other workflow, if a SCOM admin intentionally misused it…. Similar to solution #2 above.

e.  This is just an idea I had – curious if anyone has already developed a solution like this?

The Cluster Service will automatically restart itself

$
0
0

Something I ran across with a customer.

There aren’t many situations where service recoveries run automatically in Microsoft MP’s, but this is one case where they do.  The cluster service running is critical to a healthy cluster.  In the current cluster MP, the service monitor for the cluster service will automatically start the cluster service on a node, if it detects it stops.  There is a recovery action on the monitor to do just that.

 

image

 

 

As always – if you don't like this intended behavior – you can override just the recovery, and disable it.

 

Why do you need to know this?

Because – some service packs for clustered applications, require you to stop the cluster service, in order to apply.  If you stop this service on a node while doing application maintenance, SCOM will restart it, almost immediately.  The correct solution – is to use Maintenance Mode in SCOM, which will unload the monitors, and hence, any automatic recoveries will no long run.  So…. make SURE you are effectively using maintenance mode if you ever need to stop your cluster service, or, disable this automatic recovery action.

Are your agents restarting every 10 minutes? Are you sure?

R2 – Improved Agent Proxy Alerts

$
0
0

Here is a nice add in R2:  When we give you the old “agent proxy alert”, we now tell you the name of the Agent that needs agent proxy enabled, and resolve the name of the object type that it was bringing in:

 

Nice improvement.  I enable agent proxy for SQL1CLN2 and get on with my day.

 

image


Health Service and MonitoringHost thresholds in R2 – how this has changed and what you should know

Fixing troubled agents

$
0
0

Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine.  Agent health is an ongoing task of any OpsMgr Admin’s life.

This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time.  There are so many factors in an agent’s ability to communicate and work as expected.  A few key areas that commonly affect this are:

  • DNS name resolution (Agent to MS, and MS to Agent)
  • DNS domain membership (disjointed)
  • DNS suffix search order
  • Kerberos connectivity
  • Kerberos SPN’s accessible
  • Firewalls blocking 5723
  • Firewalls blocking access to AD for authentication
  • Packet loss
  • Invalid or old registry entries
  • Missing registry entries
  • Corrupt registry
  • Default agent action accounts locked down/out (HSLockdown)
  • HealthService Certificate configuration issues.
  • Hotfixes required for OS Compatibility
  • Management Server rejecting the agent

 

How do you detect agent issues from the console?  The problem might be that they are not showing up in the console at all!  Perhaps they might be a manual install that never shows up in Pending Actions?  Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”.  Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

 

One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log.  This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent.  That is ALWAYS one of my first steps in troubleshooting.

 

Another way of examining Agent health – is by the built in views in OpsMgr.  In the console – there is a view – Located at the following:

 

image

 

 

This view is important – because it gives us a perspective of the agent from two different points:

1.  The perspective of the agent monitors running on the agent, measuring its own “health”.

2.  The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server".

 

If any of these are red or yellow – that is an excellent place to start.  This should be an area that your level 1 support for Operations manager checks DAILY.  We should never have a high number of agents that are not green here.  If they aren't – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

 

Now…. the following are some general steps to take to “fix” broken agents.  These are not in definitive order.  The order of steps really comes down to what you find when looking at the logs after taking these steps.

 

  • Start the HealthService on the agent.  You might find the HealthService is just not running.  This should not be common or systemic.  Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure.  However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM.  Look in the OpsMgr event log for verification.

 

  • Bounce the HealthService on the agent.  Sometimes this is all that is needed to resolve an agent issue.  Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

 

  • Clear the HealthService queue and config (manually).  This is done by stopping the HealthService.  Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder.  Then start the HealthService.  This removes the agent config file, and the agent queue files.  The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to.  From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not.  This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value.  The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while.  This is very much what happens on a new agent during initial deployment.

 

  • Clear the HealthService queue and config (from the console).  When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane - “Flush Health Service State and Cache”.  This will perform a very similar action to that above…. as a console task.  This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server.  This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.

 

  • “Repair” the agent from the console.  This is done from the Administration pane – Agent Managed.  You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action.  A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment.  It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

 

  • Reinstall the agent (manually).  This would be for manual installs or when push/repair is not possible.  This section is where the combination of options gets a little tricky.  When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way.  This means performing the following steps:
    • Uninstall the agent via add/remove programs.
    • Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe.  This is designed to make sure that the service, files, and all registry entires are removed.
    • Ensure that the agent’s folder is removed at:  \Program Files\System Center Operations Manager 2007\
    • Ensure that the following registry keys are deleted:
      • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
      • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
    • Reboot the agent machine (if possible)
    • Delete the agent from Agent Managed in the OpsMgr console.  This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
    • Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent.  Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used)  If it works correctly – you can always reinstall again using low priv or AD integration.
    • Remember to import certificats at this point if you are using those on the individual agent.
    • As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

 

To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot.  However – to summarize at a very general level, my typical steps are:

  1. Review OpsMgr event log on agent
  2. Bounce HealthService
  3. Bounce HealthService clearing \Health Service State folder.
  4. Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.

 

Also – make sure you see my other posts on agent health and troubleshooting during deployment:

Console based Agent Deployment Troubleshooting table

Agent discovery and push troubleshooting in OpsMgr 2007

Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility

Agent Pending Actions can get out of synch between the Console, and the database

Which hotfixes should I apply-

Keep your management pack names SHORT in SP1!

$
0
0

I have seen this twice now… so I will blog about it.  It seems to be rare in the wild, but it will completely cripple a management group when this occurs.  So beware SP1 users!

 

This article does not apply to R2.  This is only an issue in OpsMgr 2007 SP1.

 

When you create your custom management packs – and especially your override management packs – keep the names as simple and short as possible.  There is an issue in OpsMgr SP1 – when an agent tries to download management packs – where it will fail if the MP ID (derived from the MP Name) is too long.  The worst part about this problem is there there WONT be an error logged.  What will happen – is that the agent will keep trying to re-download the MP in question, and it will block ALL other MP’s from being downloaded from that point forward.

There is no simple way to know this condition is impacting you.  What will happen – is that an agent will continue to work just fine… but will not get any NEW management packs.  So… you might think all is well, but not so.  The symptoms thatmight lead you to notice that something is wrong:

Performance data not collected for newer MP’s.

Objects not being discovered for newer MP’s

Alerts not generating as expected from newer MP’s.

 

The root problem is an age old Windows issue…. file paths over 255 characters not supported well.  This has been resolved in R2 in how the agent copies the files over.

In both cases I have seen – someone was creating an override MP for the “IBM Hardware Management Pack for IBM System x and BladeCenter x86 Blade Systems” management pack.  So – when they created their override MP – they named it something like:  “Overrides - IBM Hardware Management Pack for IBM System x and BladeCenter x86 Blade Systems”

This equated to a Management Pack ID of:  Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems

That MP ID is 86 characters!

 

What happens…. is when this management pack is created, and an override is placed in it…. (or custom rule)… the agents that require it will:

Get contacted by the RMS to update their config, and then issue a config change request (21024 in the event log)

Receive new config from the RMS (event 21025)

Process new config – and realize they need a new MP, and request that MP (event 1200)

 

Where this process breaks…. is the next step in the chain should be that the agent RECEIVES the MP (event 1201) and then issues a statement that the new config has become active (event 1210).  The never happen.

 

Behind the scenes…. from looking at a ETL tracelog, we can see this is failing, when we try to move the file from the “downloaded files” folder to the “management packs” folder:

Error CMPFileManager::MoveManagementPackFile(MPFileManager_cpp383)MoveFile from '\\?\C:\Program Files\System Center Operations Manager 2007\Health Service State\Downloaded Files\MGNAME\1\Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems.{26504FED-2FF4-4AC4-A63D-59BF8C09F51F}.{7136257C-1791-7BAB-7072-2FA24284C102}.xml' to 'C:\Program Files\System Center Operations Manager 2007\Health Service State\Management Packs\Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems.{26504FED-2FF4-4AC4-A63D-59BF8C09F51F}.{7136257C-1791-7BAB-7072-2FA24284C102}.xml' failed with code 3(ERROR_PATH_NOT_FOUND).

 

In the example above – the bad path is:

C:\Program Files\System Center Operations Manager 2007\Health Service State\Management Packs\Override.IBM.Hardware.Management.Pack.for.IBM.System.x.and.BladeCenter.x.Blade.Systems.{26504FED-2FF4-4AC4-A63D-59BF8C09F51F}.{7136257C-1791-7BAB-7072-2FA24284C102}.xml

Which is 261 characters.  The limit is 255. 

 

Therefore – I recommend you keep your Management Pack *ID* to less than 60 characters.   You can examine your long management packs by looking in the console – generally your longest display names will be the longest ID’s:

 

image

 

Even some Microsoft MP’s are dangerously close to the limit…. such as: Microsoft.SystemCenter.VirtualMachineManager.Pro.2008.VMWare.HostPerformance with 76 characters.  In most environments you can squeak by at 79 characters…. more or less depending on where you installed your agent path.

 

Here is a SQL query you can run against the OpsDB to also detect this condition…. and quick check all your potentially long MP’s:

 

select MPName from managementpack
WHERE len(MPName) > 60

Just change 60 to whatever character count you want. 

 

DONT freak out if you have some more than 60.  Just be aware.

 image 

DO freak out if you have some more than 80!

 

If someone had the time and was really handy – you could write a monitor – that runs against the RMS – that would in turn query the OpsDB, and run this query, and change the RMS to unhealthy when over a threshold.  THAT would be cool…. and alert you when some author makes a really long MP that has the potential to break all your agents.

 

Another idea I had… was to create a correlated missing event monitor…. and when you get a 1200, but do NOT get a 1201 within, say, 15 minutes….. that might be a problem.  Of course if you wrote this and were already impacted…. the bad agents would never get your new MP to tell you.  :-) 

29106 event on RMS – Index was out of range. Wait. What?

$
0
0

Was working with a customer on this one – figured it might help others.

Saw a lot of these VERY SPECIFIC 29106 events on the RMS, specifically with the text: 

System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.

 

Here is the full event:

Event Type:      Warning
Event Source:    OpsMgr Config Service
Event Category:  None
Event ID:        29106
Date:            11/10/2009
Time:            12:43:24 PM
User:            N/A
Computer:        AGENTNAME
Description:
The request to synchronize state for OpsMgr Health Service identified by "3688d65d-a16c-2be6-7e84-5faf8a9cffe0" failed due to the following exception "System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index

What we found was – that we could look up these health service ID’s – by pasting them in the following SQL query:

select * from MTV_HealthService
where BaseManagedEntityId = '3688d65d-a16c-2be6-7e84-5faf8a9cffe0'

This would give us the name of the agent.

In the console, under Agent Managed – we found all of these agents were in “Unmonitored” state – on the agents themselves, they were stuck.  They looked like they got installed, but could not get config.  We deleted them from agent managed, waited a few minutes, and let them show back up in Pending Management.  Approved them – then they were able to come back in and work properly.  These looked for the most part like orphaned machines, and several were computers that were renamed, or old DC’s that were demoted.

The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService

$
0
0

I have written many articles in the past on HealthService restarts.  A HealthService restart is when the agent breaches a pre-set threshold of Memory use, or handle count use, and OpsMgr bounces the agent HealthService to try and correct the condition.

The Past:

Here are a few of the previous articles:

http://blogs.technet.com/kevinholman/archive/2009/03/26/are-your-agents-restarting-every-10-minutes-are-you-sure.aspx

http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx

 

Generally – this is a good thing.  We expect the agent to consume a limited amount of system resources, and if this is ever breached, we assume something is wrong, so we bounce the agent.  The problem is that if an agent NEEDS more resources to do its job – it can get stuck in a bouncing loop every 10-12 minutes, which means there is very little monitoring of that agent going on.  It also can harm the OpsMgr environment, because if this is happening on a large scale, we flood the OpsMgr database with state change events.  You will also see the agent consume a LOT of CPU resources during the startup cycle – because each monitor has to initialize its state at startup, and all discoveries without a specific synch time will run at startup.

 

However, sometimes it is NORMAL for the agent to consume additional resources.  (within reason)

The limits at OpsMgr 2007 RTM were set to 100MB of private bytes, and 2000 handles.  This was enough for the majority of agents out there.  Not all though, especially since the release of Server 2008 OS, and the use of 64bit Operating systems.  Many servers roles require some additional memory, because they run very large discovery scripts, or discovery a very large instance space.  Like DNS servers, because they discover and monitor so many DNS zones.  DHCP servers, because they discover and monitor so many scopes.  Domain controllers, because they can potentially run a lot of monitoring scripts and discovery many AD objects.  SQL servers, because they discover and monitor multiple DB engines, and databases.  Exchange 2007 servers, etc…

 

What’s new:

At the time of this writing, two new management pack updates have been released.  One for SP1, and one for R2.  EVERY customer should be running these MP updates.  I consider them critical to a healthy environment:

R2 MP Update version 6.1.7533.0

SP1 MP Update version 6.0.6709.0

What these MP updates do – is to synchronize both versions of OpsMgr to work exactly the same – and to bump up the resource threshold levels to a more typical amount.  So FIRST – get these imported if you don't have them.  Yes, now.  This alone will solve the majority of HealthService restarts in the wild.  These set the Private Bytes from 300MB (up from 100MB), and the Handle Count to 6000 (up from 2000) for all agents.  This is a MUCH better default setting than we had previously.

 

How can I make it better?

I’m glad you asked!  Well, there are two things you can do, to enhance your monitoring of this very serious condition. 

  1. Add alerting to a HealthService Restart so you can detect this condition when it still exists.
  2. Override these monitors to higher thresholds for specific agents/groups.

Go to the Monitoring pane, Discovered Inventory, and change target type to “Agent”. 

Select any agent preset – and open Health Explorer.

Expand Performance > Health Service Performance > Health Service State.

image

 

This is an aggregate rollup monitor.  If you look at the properties of this top level monitor – you will see the recovery script to bounce the HealthService is on THIS monitor…. it will run in response to ANY of the 4 monitors below it which might turn Unhealthy.

 

image

 

So – we DONT want to set this monitor to also create the alerts.  Because – this monitor can only tell us that “something” was beyond the threshold.  We actually need to set up alerting on EACH of the 4 monitors below it – so we will know if it is a problem with the Healthservice or MonitoringHost, and either memory (private bytes) or Handle Count.

First thing – is to inspect the overrides on each monitor, to make sure you haven't already adjusted this in the past.  ANY specific overrides LESS than the new default of 300MB and 6000 handles should be deleted.  (The exchange MP has a sealed override of 5000 handles and this is fine)

What I like to do – is to add an override, “For all objects of Class”.  Enable “Generates Alert”.  I also ensure that the default value for “Auto-Resolve alert is set to false.  It is critical that auto-resolve is not set to True for this monitor, because we will just close the alert on every agent restart and the alert will be worthless.  What this will do – is generate an alert and never close it, anytime this monitor is unhealthy.  I need to know this information so I can be aware of very specific agents that might require a higher value:

image

 

Repeat this for all 4 monitors.

 

One thing to keep in mind – if you ever need to adjust this threshold for specific agents that are still restarting – 600MB of private bytes (double the default) in generally a good setting.  It is rare to need more than this – unless you have a very specific MP or application that guides you to set this higher for a specific group of agents.

Also – be careful overriding this value across the board… because Management Servers also have a “HealthService” and you could inadvertently set this to be too low for them.  Generally – the default settings are very good now – and you should only be changing this for very specific agents, or a very specific group of agents.

Now – you can uses these alerts to find any problem agents out there.  I really strongly recommend setting this up for any management group out there.  You NEED to know when agents are restarting on their own.

Viewing all 179 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>