Sunday, October 9, 2011

Sun Directory Server Issues



  1. Problem statement; what the problem or perceived problem is. What are the effects of the problem?
Environmental changes: list all changes prior to the issue taking place/being seen. This include network changes, system/os changes, ds config changes, application changes, application load changes etc.

DS Data (required*):

Full DS Version
32 bit versions
cd <install root>/lib
./ns-slapd -D <path to slapd instance> -V
64 bit versions
cd <install root>/lib/64
./ns-slapd -D <path to slapd instance> -V

DS Install Type; zip/pkg
cd <install root>/dsee6/lib/bin
./carpet | grep carpetIsNativePkg

dse.ldif
found in <path to slapd instance>/config

DS Access & Error Logs
found in <path to slapd instance>/logs

DS Schema
found in <path to slapd instance>/config/schema

OS Data:
Get one of the following based on your OS version
/etc/release
/etc/redhat-release
/etc/SuSE-release
uname -a
prtconf -v
pkginfo –l

  1. Hung or Unresponsive Process
NOTE: See DirTracer's Configurator Option 1
Process Hung [1]

A hung or unresponsive directory server process is one that has stopped responding to client requests.

In some cases the DS Process is not actually hung. frozen or wedged but actually has one thread holding the rest in a locked state until the single one free's up. If this happens, the directory server will continue to accept new connections but will not process their requests. You will also see the following as the directory server exhausts all file descriptors (for connections). See: nsslapd-maxdescriptors

[24/Jul/2009:3:45:13 +0000] - ERROR<12293> - Connection - conn=-1 op=-1 msgId=-1 - fd limit exceeded Too many open file descriptors - not listening on new connection

References:
DS5 http://docs.sun.com/app/docs/doc/820-2768/gehxm?l=en&a=view
http://docs.sun.com/app/docs/doc/820-0437/6nc66m9qh?a=view

Data Capture (required):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. to really show if it's hung or changing over time.

pstack <pid>
prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1
netstat -an
iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>
If space is available, the best way to get the root cause of a DS Hang is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are holding the process in a lock.
Gcore -o <gcore file name and date>

When you manually snap a gcore, Sun will require and request the libraries used by the slapd server process in order to debug the gcore contents. This is important due to the numerous OS variations with regard to library versions. Using Pkgapp allows the customer to quickly snapshot the servers OS, DS libraries, Pstack etc. which speeds up the debugging process.

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:
32 bit process
./pkgapp -c <gcore> -p <install root>/lib
64 bit process
./pkgapp -c <gcore> -p <install root>/lib/64

df -k
/var/adm/messages

3. High CPU

NOTE: See DirTracer's Configurator Option 2
High CPU [2]

High cpu issues can occur when the directory server is dealing with many ldap operations such as very restrictive or excessive aci's, unindexed searches, group based searches etc.

Data Capture (required *):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. is the cpu steady, growing or shrinking.

pms.sh <slapd pid> <interval> <numberofchecks>
pstack <pid>
prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1
netstat -an
iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>
cn=monitor searches
If space is available, the best way to get the root cause of a High CPU issue is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are using the most cpu as well as what the thread is actually processing.

Gcore -o <gcore file name and date>

When you manually snap a gcore, Sun will require and request the libraries used by the slapd server process in order to debug the gcore contents. This is important due to the numerous OS variations with regard to library versions. Using Pkgapp allows the customer to quickly snapshot the servers OS, DS libraries, Pstack etc. which speeds up the debugging process.

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:
32 bit process
./pkgapp -c <gcore> -p <install root>/lib
64 bit process
./pkgapp -c <gcore> -p <install root>/lib/64

df -k
/var/adm/messages
ACI search. Listing all ACI's will show us

4.Replication

NOTE: See DirTracer's Configurator Option 3
Replication [3]

Replication can issues can be seen in many ways.

Updates not going through; i.e. replication broken between one or more servers.
Updates slow to get through
References:
DS5 http://docs.sun.com/app/docs/doc/820-0437/6nc66m9qj?a=view
DS6 http://docs.sun.com/app/docs/doc/820-2768/replication?a=view

replication debug logging for 20 mins on each of the affected servers. For example, if replication is slow or broken from one master to a single consumer, then get debug loggin from each of these servers "at the same time". Remember to note the current infolog-area etc values before you change it to replication debug logging (8192). Once you have gathered the logs for 20 minutes, change this back to the old value.
$ ldapmodify -h host -p port -D "cn=Directory Manager" -w password
dn: cn=config
changetype: modify
replace: nsslapd-infolog-area # nsslapd-errorlog-level in 5.1
nsslapd-infolog-area: 8192

ruv searches from the broken backend on each of the affected servers.
$ ldapsearch -h host -p port -D "cn=Directory Manager" -w password -b "<replicated suffix>" -s base "(&(objectclass=nstombstone)(nsUniqueId=ffffffff-ffffffff-ffffffff-ffffffff))"

cn=config search

$ ldapsearch -h host -p port -D "cn=Directory Manager" -w password -b "cn=config" -s base "objectClass=*"

insync. Note the output from insync and whether the delay(s) are getting worse, getting better or staying the same.
The insync command indicates the state of synchronization between a master replica and one or more consumer replicas. The following command shows the state over a period of 30 seconds.
server-root/shared/bin/insync -s "cn=Directory Manager:password@hostname1:ldap-port" -c "cn=Directory Manager:password@hostname2:ldap-port" 30

repldisc. Repldisc or "Replication Discovery" will display the replication topology in a text based matrix

server-root/shared/bin/repldisc -D "cn=Directory Manager" -w password -b <replicated suffix> -s host:ldap-port

5. Crashing

NOTE: See DirTracer's Configurator Option 4
Crashing [4]

When the directory server process has unexpectedly died gather the following data. For instructions on preparing your system to produce core files or crash dumps in the event of a crash, see 1.6 Configuring the Operating System to Generate Core Files.

References:
DS5 http://docs.sun.com/app/docs/doc/820-0437/data-for-crash-problems?a=view
OS http://docs.sun.com/app/docs/doc/820-0437/6nc66m9ql?a=view

corefile (unix)/crash dump (windows): pkgapp with the -i switch on Unix to "Include" the corefile.
pkgapp (unix based systems only).
./pkgapp -i -c <corefile> -p <full path (path only) to process binary> -s <path to write tar file>
Pkgapp will gather the following automatically

OS info
file corefile
pstack corefile. Execute pkgapp to see its full usage.
pmap corefile
pldd corefile
pflags corefile
df -k (unix based systems only)

6.Memory Leaks

NOTE: See DirTracer's Configurator Option 5
Memory Leak [5]

Memory leaks are a very troublesome problem to gather data for.

References:
DS6 http://docs.sun.com/app/docs/doc/820-2768/gegyp?a=view

Unix OS':

Before proceeding with the next set of steps use pms.sh to determine the memory growth profile of your directory server process. Sun Support can plot the data from pms.sh to show if over time there is a real issue with memory not freeing.

1) start the directory server instance
2) prime the directory server's caches by using the following search.

$ ldapsearch -h host -p port -D "cn=Directory Manager" -w password -b "<suffix>" -s sub "(&(objectClass=*))" * >> /dev/null

3) Launch pms.sh (or perfmon) with a large enough parameter (number of checks) that we can see the process size increase significantly.
Ex: ./pms.sh 60 10000000000 >> /tmp/pms.mem.out
4) Test/use the ldap applications.

pms.sh can be found in the <DirTracer Install Location>dirtracertools for various unix OS'
pms.sh - http://www.sun.com/bigadmin/scripts/indexSjs.html

Once it has been determined there is a leak, you can use one of the following methods for determining which function(s) are not freeing memory.


7) Server Down

NOTE: See DirTracer's Configurator for the Server Down option

Same as 0 - Basic

8) Startup issues
NOTE: See DirTracer's Configurator Option 7
Basic Capture [7]
Most startup issues can be dealt with using truss/strace/tusc/DebugView etc and trace debugging from the directory server.
*                               truss etc
*                               debug logging
Solaris
truss -feao truss.out -rall -wall -o /tmp/truss.log ./start-slapd
HP-UX
tusc -v -fealT -rall -wall -o /tmp/truss.out ./start-slapd
Redhat/SuSE
strace -fv -o /tmp/strace.out ./start-slapd
Windows
DebugView is available at http://www.sysinternals.com/Utilities/DebugView.html .
Is may also be beneficial to gather directory server debug logging during the startup process.
Once you have gathered the debug logs (after the directory server starts), change this back to the old value.
$ ldapmodify -h host -p port -D "cn=Directory Manager" -w password
dn: cn=config
changetype: modify
replace: nsslapd-infolog-area # nsslapd-errorlog-level in 5.1
nsslapd-infolog-area: 1


9) High IO

NOTE: See DirTracer's Configurator Option 7 for Basic Capture w/ gcores.

High IO issues can occur when the directory server is dealing with many ldap operations such as massive writes, purging, unindexed searches, group based searches etc.

Data Capture (required *):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. is the cpu steady, growing or shrinking.

pms.sh <slapd pid> <interval> <numberofchecks>
pstack <pid>
prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1
netstat -an
iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>
cn=monitor searches
If space is available, the best way to get the root cause of a High IO issue is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are using the most cpu as well as what the thread is actually processing.

Gcore -o <gcore file name and date>

When you manually snap a gcore, Sun will require and request the libraries used by the slapd server process in order to debug the gcore contents. This is important due to the numerous OS variations with regard to library versions. Using Pkgapp allows the customer to quickly snapshot the servers OS, DS libraries, Pstack etc. which speeds up the debugging process.

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:
32 bit process
./pkgapp -c <gcore> -p <install root>/lib
64 bit process
./pkgapp -c <gcore> -p <install root>/lib/64

df -k
/var/adm/messages


9) High IO

NOTE: See DirTracer's Configurator Option 7 for Basic Capture w/ gcores.

High IO issues can occur when the directory server is dealing with many ldap operations such as massive writes, purging, unindexed searches, group based searches etc.

Data Capture (required *):

It is good to get the following every 1-5 seconds. This helps show process movement and or trends; i.e. is the cpu steady, growing or shrinking.

pms.sh <slapd pid> <interval> <numberofchecks>
pstack <pid>
prstat -L -n <# of ds threads>,<# of ds threads> <pid> 0 1
netstat -an
iostat -xnMCz -T d <INTERVAL> <NUMBEROFCHECKS>
cn=monitor searches
If space is available, the best way to get the root cause of a High IO issue is to grab one or more gcore files (when available on an os version). This allows Sun Support to debug the actual process contents to see which thread(s) are using the most cpu as well as what the thread is actually processing.

Gcore -o <gcore file name and date>

When you manually snap a gcore, Sun will require and request the libraries used by the slapd server process in order to debug the gcore contents. This is important due to the numerous OS variations with regard to library versions. Using Pkgapp allows the customer to quickly snapshot the servers OS, DS libraries, Pstack etc. which speeds up the debugging process.

Pkgapp: for full usage see the pdf document that comes with Pkgapp.

Examples:
32 bit process
./pkgapp -c <gcore> -p <install root>/lib
64 bit process
./pkgapp -c <gcore> -p <install root>/lib/64

df -k
/var/adm/messages


11) SSL Cert issues

certutil -L -N -W trusted_db_passwd


12) Install Issues

References:
DS5 http://docs.sun.com/app/docs/doc/820-0437/6nc66m9qg?a=view/
DS6 http://docs.sun.com/app/docs/doc/820-2768/install?a=view

install logs
truss putput
typescript or screen captures help.
Truss:

Solaris
truss -feao truss.out -rall -wall -o /tmp/truss.log <install command>

HP-UX
tusc -v -fealT -rall -wall -o /tmp/truss.out <install command>

Redhat/SuSE
strace -fv -o /tmp/strace.out <install command>

Windows
DebugView is available at http://www.sysinternals.com/Utilities/DebugView.html.

Install Logs:

For Java Enterprise System installations, collect installation error logs.
The log file is named after the date and time that the installation failed. For example, a log file for an installation that failed on December 16 at 3:32 p.m. would have a name like Java_Enterprise_System*_install.B12161532.
On Solaris systems, installation logs are located under /var/sadm/install/logs.
On Red Hat and HP-UX systems, installation logs are located under /var/opt/sun/install/logs.
On Windows systems, installation logs are located

13) DSCC/Console issues

See the following for gathering data on the DSCC

http://docs.sun.com/app/docs/doc/820-2768/gexfm?a=view
14) Schema issues

See 0 - Basic
changes to the directory server schema
examples of the problem schema errors/user info etc.
Labels parameters    

No comments:

Post a Comment