Identity & Access Management Solutions: Sun Directory Server Issues

Problem statement; what the problem or perceived problem is. What are the effects of the problem?

Environmental changes: list all changes prior to the issue taking place/being seen. This include network changes, system/os changes, ds config changes, application changes, application load changes etc.

DS Data (required*):

Full DS Version

32 bit versions

cd <install root>/lib

./ns-slapd -D <path to slapd instance> -V

64 bit versions

cd <install root>/lib/64

./ns-slapd -D <path to slapd instance> -V

DS Install Type; zip/pkg

cd <install root>/dsee6/lib/bin

./carpet | grep carpetIsNativePkg

dse.ldif

found in <path to slapd instance>/config

DS Access & Error Logs

found in <path to slapd instance>/logs

DS Schema

found in <path to slapd instance>/config/schema

OS Data:

Get one of the following based on your OS version

/etc/release

/etc/redhat-release

/etc/SuSE-release

uname -a

prtconf -v

pkginfo –l

Hung or Unresponsive Process

NOTE: See DirTracer's Configurator Option 1

Process Hung [1]

A hung or unresponsive directory server process is one that has stopped responding to client requests.

In some cases the DS Process is not actually hung. frozen or wedged but actually has one thread holding the rest in a locked state until the single one free's up. If this happens, the directory server will continue to accept new connections but will not process their requests. You will also see the following as the directory server exhausts all file descriptors (for connections). See: nsslapd-maxdescriptors

[24/Jul/2009:3:45:13 +0000] - ERROR<12293> - Connection - conn=-1 op=-1 msgId=-1 - fd limit exceeded Too many open file descriptors - not listening on new connection

References:

DS5 http://docs.sun.com/app/docs/doc/820-2768/gehxm?l=en&a=view

http://docs.sun.com/app/docs/doc/820-0437/6nc66m9qh?a=view

Data Capture (required):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. to really show if it's hung or changing over time.

pstack <pid>

prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1

netstat -an

iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>

If space is available, the best way to get the root cause of a DS Hang is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are holding the process in a lock.

Gcore -o <gcore file name and date>

When you manually snap a gcore, Sun will require and request the libraries used by the slapd server process in order to debug the gcore contents. This is important due to the numerous OS variations with regard to library versions. Using Pkgapp allows the customer to quickly snapshot the servers OS, DS libraries, Pstack etc. which speeds up the debugging process.

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:

32 bit process

./pkgapp -c <gcore> -p <install root>/lib

64 bit process

./pkgapp -c <gcore> -p <install root>/lib/64

df -k

/var/adm/messages

3. High CPU

NOTE: See DirTracer's Configurator Option 2

High CPU [2]

High cpu issues can occur when the directory server is dealing with many ldap operations such as very restrictive or excessive aci's, unindexed searches, group based searches etc.

Data Capture (required *):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. is the cpu steady, growing or shrinking.

pms.sh <slapd pid> <interval> <numberofchecks>

pstack <pid>

prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1

netstat -an

iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>

cn=monitor searches

If space is available, the best way to get the root cause of a High CPU issue is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are using the most cpu as well as what the thread is actually processing.

Gcore -o <gcore file name and date>

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:

32 bit process

./pkgapp -c <gcore> -p <install root>/lib

64 bit process

./pkgapp -c <gcore> -p <install root>/lib/64

df -k

/var/adm/messages

ACI search. Listing all ACI's will show us

4.Replication

NOTE: See DirTracer's Configurator Option 3

Replication [3]

Replication can issues can be seen in many ways.

Updates not going through; i.e. replication broken between one or more servers.

Updates slow to get through

References:

DS5 http://docs.sun.com/app/docs/doc/820-0437/6nc66m9qj?a=view

DS6 http://docs.sun.com/app/docs/doc/820-2768/replication?a=view

replication debug logging for 20 mins on each of the affected servers. For example, if replication is slow or broken from one master to a single consumer, then get debug loggin from each of these servers "at the same time". Remember to note the current infolog-area etc values before you change it to replication debug logging (8192). Once you have gathered the logs for 20 minutes, change this back to the old value.

$ ldapmodify -h host -p port -D "cn=Directory Manager" -w password

dn: cn=config

changetype: modify

replace: nsslapd-infolog-area # nsslapd-errorlog-level in 5.1

nsslapd-infolog-area: 8192

ruv searches from the broken backend on each of the affected servers.

$ ldapsearch -h host -p port -D "cn=Directory Manager" -w password -b "<replicated suffix>" -s base "(&(objectclass=nstombstone)(nsUniqueId=ffffffff-ffffffff-ffffffff-ffffffff))"

cn=config search

$ ldapsearch -h host -p port -D "cn=Directory Manager" -w password -b "cn=config" -s base "objectClass=*"

insync. Note the output from insync and whether the delay(s) are getting worse, getting better or staying the same.

The insync command indicates the state of synchronization between a master replica and one or more consumer replicas. The following command shows the state over a period of 30 seconds.

server-root/shared/bin/insync -s "cn=Directory Manager:password@hostname1:ldap-port" -c "cn=Directory Manager:password@hostname2:ldap-port" 30

repldisc. Repldisc or "Replication Discovery" will display the replication topology in a text based matrix

server-root/shared/bin/repldisc -D "cn=Directory Manager" -w password -b <replicated suffix> -s host:ldap-port

5. Crashing

NOTE: See DirTracer's Configurator Option 4

Crashing [4]

When the directory server process has unexpectedly died gather the following data. For instructions on preparing your system to produce core files or crash dumps in the event of a crash, see 1.6 Configuring the Operating System to Generate Core Files.

References:

DS5 http://docs.sun.com/app/docs/doc/820-0437/data-for-crash-problems?a=view

OS http://docs.sun.com/app/docs/doc/820-0437/6nc66m9ql?a=view

corefile (unix)/crash dump (windows): pkgapp with the -i switch on Unix to "Include" the corefile.

pkgapp (unix based systems only).

./pkgapp -i -c <corefile> -p <full path (path only) to process binary> -s <path to write tar file>

Pkgapp will gather the following automatically

OS info

file corefile

pstack corefile. Execute pkgapp to see its full usage.

pmap corefile

pldd corefile

pflags corefile

df -k (unix based systems only)

6.Memory Leaks

NOTE: See DirTracer's Configurator Option 5

Memory Leak [5]

Memory leaks are a very troublesome problem to gather data for.

References:

DS6 http://docs.sun.com/app/docs/doc/820-2768/gegyp?a=view

Unix OS':

Before proceeding with the next set of steps use pms.sh to determine the memory growth profile of your directory server process. Sun Support can plot the data from pms.sh to show if over time there is a real issue with memory not freeing.

1) start the directory server instance

2) prime the directory server's caches by using the following search.

$ ldapsearch -h host -p port -D "cn=Directory Manager" -w password -b "<suffix>" -s sub "(&(objectClass=*))" * >> /dev/null

3) Launch pms.sh (or perfmon) with a large enough parameter (number of checks) that we can see the process size increase significantly.

Ex: ./pms.sh 60 10000000000 >> /tmp/pms.mem.out

4) Test/use the ldap applications.

pms.sh can be found in the <DirTracer Install Location>dirtracertools for various unix OS'

pms.sh - http://www.sun.com/bigadmin/scripts/indexSjs.html

Once it has been determined there is a leak, you can use one of the following methods for determining which function(s) are not freeing memory.

7) Server Down

NOTE: See DirTracer's Configurator for the Server Down option

Same as 0 - Basic

8) Startup issues

NOTE: See DirTracer's Configurator Option 7
Basic Capture [7]

Most startup issues can be dealt with using truss/strace/tusc/DebugView etc and trace debugging from the directory server.

truss etc

debug logging

Solaris
truss -feao truss.out -rall -wall -o /tmp/truss.log ./start-slapd

HP-UX
tusc -v -fealT -rall -wall -o /tmp/truss.out ./start-slapd

Redhat/SuSE
strace -fv -o /tmp/strace.out ./start-slapd

Windows
DebugView is available at http://www.sysinternals.com/Utilities/DebugView.html.

Is may also be beneficial to gather directory server debug logging during the startup process.

Once you have gathered the debug logs (after the directory server starts), change this back to the old value.

$ ldapmodify -h host -p port -D "cn=Directory Manager" -w password
dn: cn=config
changetype: modify
replace: nsslapd-infolog-area # nsslapd-errorlog-level in 5.1
nsslapd-infolog-area: 1

9) High IO

NOTE: See DirTracer's Configurator Option 7 for Basic Capture w/ gcores.

High IO issues can occur when the directory server is dealing with many ldap operations such as massive writes, purging, unindexed searches, group based searches etc.

Data Capture (required *):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. is the cpu steady, growing or shrinking.

pms.sh <slapd pid> <interval> <numberofchecks>

pstack <pid>

prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1

netstat -an

iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>

cn=monitor searches

If space is available, the best way to get the root cause of a High IO issue is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are using the most cpu as well as what the thread is actually processing.

Gcore -o <gcore file name and date>

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:

32 bit process

./pkgapp -c <gcore> -p <install root>/lib

64 bit process

./pkgapp -c <gcore> -p <install root>/lib/64

df -k

/var/adm/messages

9) High IO

NOTE: See DirTracer's Configurator Option 7 for Basic Capture w/ gcores.

High IO issues can occur when the directory server is dealing with many ldap operations such as massive writes, purging, unindexed searches, group based searches etc.

Data Capture (required *):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. is the cpu steady, growing or shrinking.

pms.sh <slapd pid> <interval> <numberofchecks>

pstack <pid>

prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1

netstat -an

iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>

cn=monitor searches

Gcore -o <gcore file name and date>

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:

32 bit process

./pkgapp -c <gcore> -p <install root>/lib

64 bit process

./pkgapp -c <gcore> -p <install root>/lib/64

df -k

/var/adm/messages

11) SSL Cert issues

certutil -L -N -W trusted_db_passwd

12) Install Issues

References:

DS5 http://docs.sun.com/app/docs/doc/820-0437/6nc66m9qg?a=view/

DS6 http://docs.sun.com/app/docs/doc/820-2768/install?a=view

install logs

truss putput

typescript or screen captures help.

Truss:

Solaris

truss -feao truss.out -rall -wall -o /tmp/truss.log <install command>

HP-UX

tusc -v -fealT -rall -wall -o /tmp/truss.out <install command>

Redhat/SuSE

strace -fv -o /tmp/strace.out <install command>

Windows

DebugView is available at http://www.sysinternals.com/Utilities/DebugView.html.

Install Logs:

For Java Enterprise System installations, collect installation error logs.

The log file is named after the date and time that the installation failed. For example, a log file for an installation that failed on December 16 at 3:32 p.m. would have a name like Java_Enterprise_System*_install.B12161532.

On Solaris systems, installation logs are located under /var/sadm/install/logs.

On Red Hat and HP-UX systems, installation logs are located under /var/opt/sun/install/logs.

On Windows systems, installation logs are located

13) DSCC/Console issues

See the following for gathering data on the DSCC

http://docs.sun.com/app/docs/doc/820-2768/gexfm?a=view

14) Schema issues

See 0 - Basic

changes to the directory server schema

examples of the problem schema errors/user info etc.

Labels parameters

Identity & Access Management Solutions

Sunday, October 9, 2011

Sun Directory Server Issues

No comments:

Post a Comment

tabclicks technical solutions