Search this Blog

Monday, March 21, 2011

After node restart RAC cluster does not start

Today we have been very busy trying to figure out why after a system restart the RAC nodes didn't want to start the cluster stack anymore.

The biggest problem we had was there was absolutely no logging at all !!
The system seemed totally not startable.

All checks to get the cluster online where -in our opinion- successfull.
We were able to create an ocrdump, and query the voting disks.

Finally we started to dig into the processes that were running.
We found a process called:

/etc/init.cssd startcheck

This process seemed to hang. Waiting for something.
Looking into the script we read that this function checked for all needed resources to be available, and as long as they were not available, went for a sleep of 60 seconds.
We detected it used the AIX logging system, of which we however did not "catch" the logging messages by means of the syslog.conf file.

After a while we decided to start a "startcheck" of our own, using the debug options op de Korn shell:

# ksh -x /etc/init.cssd startcheck

The result was that it showed that there was some logging in files in /tmp calles cssxxxx where xxxx is a numeric value.

Looking into the last one of these files, it showed that the votingdisks where missing.
This is contradiction to the

# $CRS_HOME/bin/crsctl query votedisk

which still showed everything was fine with the voting disks.

The real problem was in the fact that the 'crsctl' command, just looks for the device files to be present and to be readable.
If there were/are actually disks attached to these device files was not checked.

It turned out that a major network problem we've had two days earlier disturbed something on the SAN network, making it not possible anymore to connect the logical drives to this server.


What I totally do not like here is the total lack of any logging in the usual logging location of CRS, being $CRS_HOME/log/.
Maybe in future releases ( we are using 11.1.0.7) this problem is better handled.