Search this Blog

Wednesday, July 14, 2010

AIX based RAC reboots nodes by itself

Problem statement:

After a couple of weeks running with no problem at all, our 2 node RAC running on Oracle release 11.1.0.7 on AIX 5.3 TL10 platforms started rebooting nodes by itself.


Analysis:

Analyzing the CRS, ASM and Rdbms logfiles, it simply showed the server had terminated. There were no logs to be found of a gracefull shutdown of the cluster resources.

In contradiction to this, the AIX errpt showed a "USER SHUTDOWN" as the reason for the reboot. From the viewpoint of AIX someone or something had issued a system shutdown.
Because the servers didn't crash, but restarted instead, the UNIX admins confirmed it must have been a shutdown command with restart options that brought the system down.
However in such a case, one would expect the Cluster resources to go down gracefully which would have been seen in the CRS logs.

Oracle support confirmed the sudden restart of the system, by seeing a time-gap in the logfiles. The logging of actions suddenly stopped on all resources, and then started again when the cluster resources where coming up again.


Resolving the issue:

Beause we were unable to find a cause of our own, and Oracle support was searching into the direction of a system problem ( believing that not a RAC process had restarted the server ) the created S/R got somehow stuck. There were no new things found to discover.

Only after escalating the S/R to a prio 1, after just another unscheduled reboot, Oracle support started asking about the process status of the 'oprocd' daemon.
It triggered us in starting to browse the internet for information about the 'oprocd' daemon.
Using Note 265769.1 we pinpointed the reason of the reboot must be with oprocd.

Finally we found the logfile that made sense. In /etc/oracle/oprocd we found the logfile from just before the reboot. It clearly stated:

AlarmHandler: timeout(1892 msec) exceeds interval(1000 msec)+margin(500 msec). Rebooting NOW.

What does this mean, and what is 'oprocd' ??

The 'oprocd' process is a processor monitor daemon. It monitors CPU load by setting a timer to 1000ms. After that is gets into a sleep for 1000ms. As soon as it wakes up it checks if the sleep period has been as long as the set timer. If it doesn't the maximum margin to deviate from the set timer is by default 500ms ( half a second ). If it wakes up beyond that margin, the oprocd decides to reboot the system.

Now that we really got a message to work on, we dumped that into Google and it came up with a really helpfull page: http://www.aixmind.com/?p=845
This internet page clearly described all the problems we've had, including all the symptoms and logfile entries.
It explained a couple of steps on how to solve this problem.

We send the page to Oracle support, asking if we could be suffering from that problem. After an hour we received an "Action plan" to solve the problem. The action plan Oracle support send us, matched for 90% the instructions of the page we sent.
Oracle support also informed us that this was a known problem, starting to develop as early as release 10.2.0.4.
They confirmed the default values 'oprocd' is running with ( interval 1000ms and margin 500ms ) are too low. In the newest release - I recon 11.2 - these defaults will be increased.

I think it is unfortunate that it takes three weeks to come with a 'known' problem and then only after we sent a hint to Oracle support.


The solution:
Oracle Support send us the following action plan to be implemented:

1. Execute as root
#crsctl stop crs

2. Execute as root
#/bin/oprocd stop

3. Ensure that Clusterware stack is down by running ps -ef egrep "crsd.binocssd.binevmd.binoprocd". This should return no processes.

4. From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force

5. From one node of the cluster execute the following command as root:
#crsctl unset css misscount -force

6. Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs

7. Validate that the node is running by executing:
#crsctl check crs

This solution increases the value for the margin from 500ms to 10000ms, so 20x higher.

This will :
a) Give less chance that oprocd get outside of the total time period ( 11000ms ).
b) If there is somehow a problem, which will bring oprocd outside of this time, the system has an addition 9500ms to write more informational logs to disk.


Unfortunately our RAC nodes did not come up correctly after the change.

The 'crs_stat -t' showed:

# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....D1.inst application ONLINE OFFLINE
ora....D2.inst application ONLINE OFFLINE
ora.WCSPRD.db application ONLINE OFFLINE
ora....RD1.srv application ONLINE OFFLINE
ora....RD2.srv application ONLINE OFFLINE
ora....sprd.cs application ONLINE OFFLINE
ora....SM1.asm application ONLINE ONLINE vsv1h171ps
ora....PS.lsnr application ONLINE ONLINE vsv1h171ps
ora....1ps.gsd application ONLINE ONLINE vsv1h171ps
ora....1ps.ons application ONLINE ONLINE vsv1h171ps
ora....1ps.vip application ONLINE ONLINE vsv1h171ps
ora....SM2.asm application ONLINE ONLINE vsv1h181ps
ora....PS.lsnr application ONLINE ONLINE vsv1h181ps
ora....1ps.gsd application ONLINE ONLINE vsv1h181ps
ora....1ps.ons application ONLINE ONLINE vsv1h181ps
ora....1ps.vip application ONLINE ONLINE vsv1h181ps

and the alert_log of the databases kept showing:

Tue Jul 13 22:11:18 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:12:08 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:12:56 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:13:24 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:13:51 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:28:17 2010
Starting ORACLE instance (normal)

The nodes had to be completely rebooted to get up and running.
This might be explained by the fact that oprocd is build into the AIX kernel, which would mean he kernel has to be reloaded after the above changes.