OCR Repair – yet another scenario

Few months back I had written about an situation when the OCR file was corrupt and we had to repair the file to fix issues.  We ran into yet another scenario in our 10gR2 RAC environment.  But this time we really did not notice any accessibility issues for the CRS, because the CRS stack would get started just fine however the database, instances and or the database services would not get started by the CRS.  

While the basic solution is to restore the OCR file from a clean backup or export file, the troubleshooting and maintenance tasks maybe a bit different. Checking the status of the applications managed by the clusterware we noticed that the database, instance and database services did not start.  

HA Resource                                        Target          State
———–                                                 ——             —–
ora.TSTDB.TSTDB1.inst                    ONLINE      OFFLINE on tstdb1
ora.TSTDB.TSTDB2.inst                    OFFLINE     OFFLINE on tstdb2
ora.TSTDB.TSTDB3.inst                    OFFLINE     OFFLINE on tstdb3
ora.TSTDB.EXTVEN.TSTDB1.srv  OFFLINE     OFFLINE on tstdb1
ora.TSTDB.EXTVEN.TSTDB2.srv  OFFLINE     OFFLINE on tstdb2
ora.TSTDB.EXTVEN.TSTDB3.srv  OFFLINE     OFFLINE on tstdb3
ora.TSTDB.EXTVEN.cs                      OFFLINE     OFFLINE on tstdb1
ora.TSTDB.db                                        OFFLINE     OFFLINE on tstdb1
ora.tstdb1.ASM1.asm                         ONLINE      ONLINE  on tstdb1
……………………….
……………………….  

Repeated attempts to start the database and instances using srvctl did not help.  So we opted to do some troubleshooting.  

Check the database and instances manually:  Using SQL Plus try to start a database instance. If this works then ASM and DB are in good health and its just the CRS that is not able to start or display the status of the database. Its not able to retrieve the database/instance/service definitions from the OCR file and perform the required operations.  

As user ‘root’, shutdown crs on all servers..   

 /etc/init.d/init.crs stop 

Check if the /etc/oracle/ocr.loc file has the following entries..  

[oracle@tstdb2 mvallath]$ cat /etc/oracle/ocr.loc   
ocrconfig_loc=/dev/raw/ocr1                           <== PRIMARY COPY OF THE OCR FILE  
ocrmirrorconfig_loc=/dev/raw/ocr2            <== MIRRORED COPY OF THE OCR FILE  
local_only=FALSE  

If the cluster detects that the OCR file is corrupt it will disable the corrupt file and continue with the mirrored copy.  In which case you may see the ocr.loc file was altered by the clusterware.  Most of the time the mirrored copy should be good and the clusterware should start all applications.  However in this case the mirrored copy was only partially good.  All components of the clusterware stack and applications with the exception of the database, instances and database services were up and visible to the CRS.  

[oracle@tstdb1 ~]$ cat /etc/oracle/ocr.loc  
#Device/file /dev/raw/ocr1 being deleted   <==ALTERED BY CLUSTERWARE..   
ocrconfig_loc=/dev/raw/ocr2                          <== ALTERED BY CLUSTERWARE..   
local_only=false  

Please take note from the output above, that clusterware not only marked the primary copy as ‘deleted’ but also renamed the mirrored copy as the new primary copy and changed the local_only=false (false was in upper case earlier). Before restoring the good copy of the OCR file, the ocr.loc file needs to be changed to reflect its original state. Manually edit this file with the appropriate entries or get a good copy of the ocr.loc (if other servers have not been modified) moved from one of the other servers to this location.  

Normally under such circumstances, CSS could coredump.  The dump file is located in $ORA_CRS_HOME/log//cssd directory.  Analyzing the coredump we also found the following string “#Device/file %s being deleted  

Once the ocr.loc file is updated on all servers in the cluster, import the OCR from a latest export file as user ‘root’  

/app/oracle/product/crs/ocrconfig -import /home/oracle/mvallath/OCRexp28Apr2010.dmp  

If you do not have an good export .. you can get a good copy of the backup.. (clusterware performs an automatic backup of the OCR file every 4 hours)  and restore the OCR file..  

/app/oracle/product/crs/ocrconfig -showbackup  
tstdb1     2010/04/28 09:05:52     /app/oracle/product/crs/cdata/tstcw 

List all the backups found in the location from the above output..      

[oracle@tstdb1 ~]$ ls -ltr /app/oracle/product/crs/cdata/tstcw
total 50328
-rw-r–r– 1 root root 8572928 Apr 27 21:05 day.ocr
-rw-r–r– 1 root root 8572928 Apr 28 01:05 week.ocr
-rw-r–r– 1 root root 8572928 Apr 28 01:05 day_.ocr
-rw-r–r– 1 root root 8572928 Apr 28 01:05 backup02.ocr
-rw-r–r– 1 root root 8572928 Apr 28 05:05 backup01.ocr

8572928 Apr 28 09:05 backup00.ocr root root–r– 1 -rw-r

  

Find the best version of the OCR backup file based on when the clusterware was stable and restore using the following command:  
/app/oracle/product/crs/ocrconfig -restore /app/oracle/product/crs/cdata/tstcw/day.ocr  

Start the clusterware on all nodes as user ‘root’
/etc/init.d/init.crs start  

Note:  Occasionally once the restore or import is completed, clusterware could go into a panic mode and reboot the server.  This is fine and is normal behavior, this normally happens when crs has not completely shutdown and has a hanging daemon on one of the servers.  

This should start the clusterware and all the applications managed by the CRS.  once the clusterware components have been verified and database has been checked.. It would be a good practice to export the OCR file..as a good copy.  

Thanks to Baldev Marepally for helping trouble shoot this issue..  

Advertisements

coexist 10gR2 and 11gR2 RAC db on the same cluster – Part II

I accidently posted this blog entry over my previous entry on this same topic.  Thanks to Google I was able retrieve my old post from the Google cache and posted it again to my blog.

———————————–

My previous post discussed about the various stumbling blocks we encountered during our 10gR2 database installation in a 11gR2 environment. We took it a step at a time to troubleshoot and install the database documenting and fixing the issues as we go. Yesterday browsing through Metalink I noticed a very recent article on the same subject.
Pre 11.2 Database Issues in 11gR2 Grid Infrastructure Environment [ID 948456.1]
Which recommends several patches and steps that could help ease the installation process

coexist 10gR2 and 11gR2 RAC db on the same cluster.. stumbling blocks

Due to project/application requirements we had to create a new 10gR2 database on a 11gR2 cluster. These are the high level steps that were attempted to complete this effort.

  1. Install 11gR2 Grid Infrastructure
  2. Create all ASM diskgroups using asmca
  3. Install 11gR2 database binaries
  4. Create the 11gR2 database using dbca from the 11gR2 DB home
  5. Install 10gR2 database binaries
  6. Create 10gR2 database using dbca from the 10gR2 DB home

Once all the prerequisites are met, 11gR2 installation is a very smooth process. Everything goes so smooth. some of us who have worked the some of the true clustered database solutions such as Rdb on VMS clusters (many of you don’t even know that Oracle owns another database called Oracle Rdb.  Oracle acquired this excellent database from Digital Equipment Corporation a.k.a DEC around 1992 and surprisingly Oracle Rdb is used by many customers even today to manage their VLDB systems), Oracle Parallel Servers (OPS) and then most recently with 9iR2 RAC, would remember how difficult it was to complete the installation. Oracle has come a long way in streamlining this process. Its so easy and the entire 11gR2 RAC configuration can be completed with little or no effort in less than 1 hour.

Once the 11gR2 environment as up and running, the next step was to configure the 10gR2 RAC database on the same cluster. We first installed the 10gR2 binaries. runInstaller was able to see that there was a cluster already installed. During the verification step, installer complained of an incompatible version of clusterware  on the server.  We ignored the error and moved on. Binaries installed successfully on all nodes in the cluster.  After installed 10.2.0.1 we complete the upgrade to 10.2.0.4

Note: When 10gR2 installer was available 11g was not, then how can it be aware of a higher version. Higher versions are almost always compatible with the lower versions.  With this idea we moved on.

Stumbling Block I

Next step was to configure the database using dbca. Its important to install the dbca from the 10gR2 /bin directory.  we noticed that the intro screen was different, it did not show the choices we normally see in a clustered database installation.  We did not get the choice to select between creating a ‘RAC’ database or a ‘single instance database’.  This indicated that something was wrong. Why did the installer see that there was a clusterware already there and this is a RAC implementation.  Why not dbca?  Searching through Oracle documentation I found this note.

When Oracle Database version 10.x or 11x is installed on a new Oracle grid infrastructure for a cluster configuration, it is configured for dynamic cluster configuration, in which some or all IP addresses are provisionally assigned, and other cluster identification information is dynamic. This configuration is incompatible with older database releases, which require fixed addresses and configuration.

You can change the nodes where you want to run the older database to create a persistent configuration. Creating a persistent configuration for a node is called pinning a node.”

We can check if the nodes are pinned using olsnodes command. You have a new switch in 11gR2 that lists the pinned status for a node. 

[prddb1] olsnodes -h
Usage: olsnodes [ [-n] [-i] [-s] [-t] [<node> | -l [-p]] | [-c] ] [-g] [-v]
        where
                -n print node number with the node name
                -p print private interconnect address for the local node
                -i print virtual IP address with the node name
                <node> print information for the specified node
                -l print information for the local node
                -s print node status – active or inactive
                –t print node type – pinned or unpinned
                -g turn on logging
                -v Run in debug mode; use at direction of Oracle Support only.
                -c print clusterware name

[prddb1] olsnodes -t
prddb1     Unpinned
prddb2     Unpinned
prddb3     Unpinned

Pinning of a node is done using the crsctl utility.  crsctl and olsnodes are both located in $GRID_HOME/bin directory.

crsctl pin css –n prddb1

check if they are pinned

[prddb1] olsnodes -t
prddb1     Pinned
prddb2     Pinned
prddb3     Pinned

Stumbling Block II

dbca was now able to see the RAC cluster and we continued.  Ran into the second stumbling block after selecting ASM as the storage manager. “ASM instance not found .. press ok to configure ASM” 

In  11gR2 listeners are driven by the SCAN feature, meaning there is one SCAN listener for each SCAN IP defined in the DNS server. Apart from the three scan listeners, there is a parent SCAN listener that listens on the various database services and connections. Listener service is named as LISTENER in 11gR2 however called LISTENER_<HOSTNAME> in Oracle 10gR2. the dbca log files located in $ORACLE_HOME/cfgtools/dbca/trace.log showed the following entries..

[AWT-EventQueue-0] [11:7:53:935] [NetworkUtilsOPS.getLocalListenerProperties:912] localNode=prddb1, localNodeVIP=prddb1-vip
[AWT-EventQueue-0] [11:7:53:935] [NetworkUtilsOPS.getLocalListenerProperties:913] local listener name = LISTENER_prddb1
[AWT-EventQueue-0] [11:7:53:939] [NetworkUtilsOPS.getLocalListenerProperties:923] No endpoint found for listener name=LISTENER_prddb1
[AWT-EventQueue-0] [11:7:53:939] [ASMAttributes.getConnection:209] getting port using fallback…
[AWT-EventQueue-0] [11:7:53:940] [ASMInstanceRAC.validateASM:609] oracle.sysman.assistants.util.CommonUtils.getListenerProperties(CommonUtils.java:455)

Metalink describes this error and suggests that we use the LISTENER name that is currently configured during database creation.  Well this also did not help.

The workaround was to add the qualified listener to the listener.ora file and to reload the listener.

Stumbling Block III

The listener fix got us through the ASM configuration and database creation. Hoping that is all the issues we could potentially run into we just moved ahead and defined database services as part of the database creation process.

Our next encounter was with the instance /database/services startup process. dbca failed with errors unable to start database, instance and database services. The entries made it to the OCR file as was evident from the crsstat output. However they would not start?  As the obvious step was to check if these resources are registered with the OCR, we checked the status of the database using srvctl

srvctl status database –d  <dbname>  there was no output. How could this be, the crsstat showed the entries but the srvctl gave no results to the check. 

Next step was to take a dump of the OCR file and see if the entries was really in it, found the entries. After considerable research determined that the entire format and structure, syntax of the srvctl utility in 11gR2 is different compared to 10gR2. Tried the srvctl utility from 10gR2 home and it did the trick

We have both 11gR2 and 10gR2 RAC on a 11gR2 clusterware/Grid Infrastructure cluster.  Both using ASM, 11gR2

Logs in 10gR2 RAC..

 

I have received several emails asking help with why oracle does not write information into the alert log files during database startup failures.. Due to practice we tend to look for instance or database related information in our standard log directories such as $ORACLE_BASE/admin/…/bdump or $ORACLE_HOME/network/log directories.  This causes panic and anxiety, searching Google, open entries on OTN forums or open an SR with Oracle support.  The Oracle documentation has also not done a good job in this area. 

Entries are not found in the alert log because the database/instance was not started using SQL*Plus entries are not added to the  db alert log. Depending on what we are trying to look for, what area of the stack is being examined, or at what state of the application is running under,  there are different kinds and flavors and locations where you could find logs.

Note:  This is not a complete list, but a start for the beginners to navigate their way through the troubleshooting process.

1.  During installation:

  • When the installation is started until such time that the screen where the orainventory location is specified the logs are written to the /tmp directory of the server from where the installer is executed.  Subsequently logs get written to $ORACLE_BASE/oraInventory/logs location.
  • When creating a database using dbca or configuring network using netca  the logs are generated to $ORACLE_HOME/cfgtoollogs/. Depending on the configuration assistant used the logs are created under specific directories.

[oracle@] cfgtoollogs]$ ls -ltr
total 40
…………………………

drwxr-x— 2 oracle dba 4096 Feb 16 16:50 oui
drwxr-x— 3 oracle dba 4096 Feb 18 13:45 emca    <===== EMCA (enterprise Manager) log files
drwxr-xr-x 2 oracle dba 4096 Feb 22 22:22 catbundle
drwxr-x— 3 oracle dba 4096 Feb 23 13:13 dbca    <=====  DBCA log files
drwxr-xr-x 2 oracle dba 4096 Feb 23 13:22 netca   <=====  NETCA log files
drwxr-xr-x 3 oracle dba 4096 Mar 18 16:47 opatch  <=====  opatch log files

2.  Clusterware startup

  • On server start or reboot, the information pertaining the various devices are written to the system logs files directory. The location of these log files can vary depending on the OS, on Linux servers they are located in  /var/log/messages.
  • When the OS starts the clusterware, messages are written to two locations, (a) to the alert log located at $ORA_CRS_HOME/log/< >alert<nodename>.log and  (b) for the various clusterware daemon processes to the respective daemon directories.
  • After the clusterware demons are up and running, the clusterware attempts to start the resources configured and managed by the CRS. This is done by reading the configuration settings from the OCR file. During this process, the CRS generates log files for each resource.
    • For the nodeapps which includes VIP, ONS and GRD the log files are located in $ORA_CRS_HOME/log/racg directory.
    • The logs related to the other resources are located under the respective home directories. For example the ASM startup log file is located in the $ASM_HOME/log/<nodename>/racg/ directory and the RBDMS database related resources will be located in the $ORACLE_HOME/log/<nodename>/racg directory.

[oracle@prddb1]$ ls -ltr $ORA_CRS_HOME/log/prddb1
total 168
drwxr-x— 2 oracle dba  4096 Feb 16 14:41 admin   <=== 
drwxr-x— 2 root   dba  4096 Feb 16 14:44 crsd    <===  cluster ready services deamon log files
drwxr-x— 2 oracle dba  4096 Mar 18 09:17 evmd    <===  event manager deamon log files
drwxr-x— 4 oracle dba  4096 Mar 23 13:12 cssd    <===  cluster synchronization services deamon log files
-rw-rw-r– 1 root   dba 48915 Mar 29 02:44 alertprddb1.log <===  clusterware alert log
drwxrwxr-t 5 oracle dba  4096 Mar 29 11:03 racg    <===  all clusterware nodeapps such as VIP, ONS, etc
drwxr-x— 2 oracle dba 98304 Mar 31 15:57 client  <===  all clusterware client log files

[oracle@prddb1 racg]$ ls -ltr $ASM_HOME/log/prddb1/racg
total 68
drwxr-xr-t 2 oracle dba  4096 Feb 16 15:53 racgmain
drwxr-xr-t 2 oracle dba  4096 Feb 16 15:53 racgeut
drwxr-xr-t 2 oracle dba  4096 Feb 16 15:53 racgmdb
drwxr-xr-t 2 oracle dba  4096 Feb 16 15:53 racgimon
-rw-r–r– 1 oracle dba  2256 Mar 23 13:04 imon.log
-rw-r–r– 1 oracle dba 22423 Mar 25 10:06 ora.prddb1.LISTENER_PRDDB1.lsnr.log <== ASM listener log
-rw-r–r– 1 oracle dba  2617 Mar 25 10:06 mdb.log
-rw-r–r– 1 oracle dba 17696 Mar 25 10:06 ora.prddb1.ASM1.asm.log  <==  asm log generated by CRS during resource startup

[oracle@prddb1 racg]$ ls -ltr $ORACLE_HOME/log/prddb1/racg/
total 940

-rw-r–r– 1 oracle dba  43669 Mar 25 10:06 ora.prddb1.RACPROD_PRDDB1.lsnr.log   <==  database listener log
-rw-r–r– 1 oracle dba   6155 Mar 29 11:00 ora.RACPROD.RMANBKUP.RACPROD1.srv.log  <== database service log
-rw-r–r– 1 oracle dba  15474 Mar 29 11:01 ora.RACPROD.RMANBKUP.cs.log   <== database composite service log
-rw-r–r– 1 oracle dba  11606 Mar 29 11:16 ora.RACPROD.RACPROD1.inst.log  <== database instance startup log

3. Database logs

Once the database is up and running then the log entries are written to their normal destinations…  for example the database logs will be written to $ORACLE_BASE/admin/<database name>/bdump/alert_<db instancename>.log

Note: Only the errors during startup are logged under $ORACLE_HOME/log/<nodename>/racg location.  If the startup is clean the log at this location only has a success entry and the remaining startup entries are  written to the alert_<instancename>.log on the respective nodes for the respective instances.

In 11gR2 this is again different.. we will discuss that in a different blog at a later time.

10gR2 RAC – POC – additional VIP to isolate network traffic

In a RAC configuration we traditional have a  VIP associated to the public address for every node in the cluster.  Since this is the only public network to the database server, all traffic is directed through this VIP.

Note:  Multiple networks in the cluster are not officially  supported for Oracle 10.2 RAC.    So this entire venture of multiple networks was at our risk.

We had a requirement to isolate application traffic received from the application tier from the regular public/user/third party application traffic.  For this an additional public IP was configured for each server in the cluster and an additional VIP was defined on this new public IP.   Since the intention was to use this new VIP only for the application tier, we called it the middle tier (mt) VIP.     So this note discuss the process followed and the outcome

Steps to add the new VIP

There are two methods of adding a VIP to CSR from the command prompt as discussed below or by creating a ‘cap’ file that contains the profile definitions.  This is a 3 node cluster, the nodes are ebspocdb1, ebspocdb2, ebspocdb3

1. Create an user VIP using the following command as user ‘oracle’

/app/oracle/product/crs/bin/crs_profile -create ora.ebspocdb1-mt.vip -t application -d ‘Mid Tier application VIP’ -a /app/oracle/product/crs/bin/usrvip -h ebspocdb1 -p favored -o as=1,ap=1,ra=0,oi=eth3,ov=112.30.1.42,on=255.255.255.0
/app/oracle/product/crs/bin/crs_profile -create ora.ebspocdb2-mt.vip -t application -d ‘Mid Tier application VIP’ -a /app/oracle/product/crs/bin/usrvip -h ebspocdb2 -p favored -o as=1,ap=1,ra=0,oi=eth3,ov=112.30.1.43,on=255.255.255.0
/app/oracle/product/crs/bin/crs_profile -create ora.ebspocdb3-mt.vip -t application -d ‘Mid Tier application VIP’ -a /app/oracle/product/crs/bin/usrvip -h ebspocdb3 -p favored -o as=1,ap=1,ra=0,oi=eth3,ov=112.30.1.44,on=255.255.255.0

2. Once the profile has been created, register the resource with CRS..

/app/oracle/product/crs/bin/crs_register ora.ebspocdb1-mt.vip
/app/oracle/product/crs/bin/crs_register ora.ebspocdb2-mt.vip
/app/oracle/product/crs/bin/crs_register ora.ebspocdb3-mt.vip

3. This resource should be owned by ‘root’.  As user ‘root’ change the ownership of the resource and then give ‘oracle’ user execute privilege for this resource.   You can execute these commands from any node in the cluster.

/app/oracle/product/crs/bin/crs_setperm ora.ebspocdb1-mt.vip -o root
/app/oracle/product/crs/bin/crs_setperm ora.ebspocdb1-mt.vip -u user:oracle:r-x

/app/oracle/product/crs/bin/crs_setperm ora.ebspocdb2-mt.vip -o root
/app/oracle/product/crs/bin/crs_setperm ora.ebspocdb2-mt.vip -u user:oracle:r-x

/app/oracle/product/crs/bin/crs_setperm ora.ebspocdb3-mt.vip -o root
/app/oracle/product/crs/bin/crs_setperm ora.ebspocdb3-mt.vip -u user:oracle:r-x

4. Start the resource as user ‘oracle’

/app/oracle/product/crs/bin/crs_start -c ebspocdb1 ora.ebspocdb1-mt.vip
/app/oracle/product/crs/bin/crs_start -c ebspocdb2 ora.ebspocdb2-mt.vip
/app/oracle/product/crs/bin/crs_start -c ebspocdb3 ora.ebspocdb3-mt.vip

5.  Verify if VIP has been configured and started

[oracle@ebspocdb1 ~]$ crsstat | grep .vip
ora.ebspocdb1.vip           ONLINE     ONLINE on ebspocdb1
ora.ebspocdb1-mt.vip        ONLINE     ONLINE on ebspocdb1
ora.ebspocdb2.vip           ONLINE     ONLINE on ebspocdb2
ora.ebspocdb2-mt.vip        ONLINE     ONLINE on ebspocdb2
ora.ebspocdb3.vip           ONLINE     ONLINE on ebspocdb3
ora.ebspocdb3-mt.vip        ONLINE     ONLINE on ebspocdb3

NOTE:  if you need to modify this resource for any reason you could use the crs_profile update command.  However you have to first stop the resource using crs_stop command and then execute the update command. For example..
crs_profile -update ora.ebspocdb1-mt.vip -t application -a /app/oracle/product/crs/bin/usrvip -h ebspocdb1 -p favored -o oi=eth3,ov=112.30.1.42,on=255.255.255.0
After executing the update command, repeat steps 2 thru 4 to implement this change.

Create second  database listener

Create a second listener on the new mt.vip  using netca on a different port.  Select the new mt VIP as the network.  This will also add the new listener to CRS.  After configuring the listener verify the status.

[oracle@ebspocdb1 ~]$ crsstat | grep .lsnr
ora.ebspocdb1.LISTENERMT_EBSPOCDB1.lsnr    ONLINE     ONLINE on ebspocdb1
ora.ebspocdb1.LISTENER_EBSPOCDB1.lsnr      ONLINE     ONLINE on ebspocdb1
ora.ebspocdb2.LISTENERMT_EBSPOCDB2.lsnr    ONLINE     ONLINE on ebspocdb2
ora.ebspocdb2.LISTENER_EBSPOCDB2.lsnr      ONLINE     ONLINE on ebspocdb2
ora.ebspocdb3.LISTENERMT_EBSPOCDB3.lsnr    ONLINE     ONLINE on ebspocdb3
ora.ebspocdb3.LISTENER_EBSPOCDB3.lsnr      ONLINE     ONLINE on ebspocdb3

Update the LOCAL_LISTENER and REMOTE_LISTENER definitions in the tnsnames.ora file.

Update the LOCAL_LISTENER and REMOTE_LISTENER definitions in the tnsnames.ora file to include the new LISTENERMT_EBSPOCDB1 listener definitions.  Once this is complete and the listener is recycled, the database services are dynamically registered with the listener.

The Results

While the default VIP was defined in the DNS server, the private VIP was only visible to the app tier. On a load test, we noticed that not all connections was established with success.  Actually every other connection failed.

The intention of having a dedicated VIP was to isolate connections from the app tier to the database tier using this private VIP.  However once the LOCAL_LISTENER parameter is configured, database services are registered with both listeners,  what’s wrong with that? When the app tier makes a connection request to the database tier using the new VIP, and when the database tier had to return the session handle back to the requestor, it could not pin it back to the same listener that received the request and almost every other time the session handle was routed to the other listener causing session death.

What would have been nice, is if the services could be mapped to a specific listener, which would allow sessions to be pinned back to the service to the listener to the requestor.

Conclusion

Its obvious why Oracle does not support multiple public networks for the database tier, since VIPs are assigned to public NIC, additional VIP did not help in this isolation.

OCR Repair..

We are in the middle of a test cycle trying to implement FAN between BEA weblogic and Oracle 10g R2 3 node RAC database on OEL 5.   As part of the configuration and setup, after adding the remote application servers to the ONS configuration.   The clusterware did not restart on reboot.

1. Checking the demons using ps -ef | grep cssd, crsd,evmd . all demons where up. however crs_stat – all or crsstat did not give any output.

2. Checking the CSSD log files I noticed the following message.. in the cssd log file

$ORA_CRS_HOME/log/prddb3/cssd/cssd.log

[CSSD]2009-12-24 19:30:36.042 [1274124608] >TRACE:   clssnmRcfgMgrThread: Local Join
[CSSD]2009-12-24 19:30:36.042 [1274124608] >WARNING: clssnmLocalJoinEvent: takeover aborted due to ALIVE node on Disk

Note:  This basically indicated that a node was locking the disk not allowing other nodes to join the cluster.  The node (prddb4) was trying to read the OCR file  (please note OCR is the first file that is accessed by the clusterware during startup) and was not able to.  This potentially indicates a bad OCR file. 

In a similar situation before, a reboot of all servers fixed the locking and the clusterware started without any hiccups.

There may have been other reasons on why this could have happened however due to the urgent nature of the problem and the time it could take to debug and or troubleshoot the situation we decided to repair the OCR file.

3. Nodes prddb3 and prddb5 was repeatable attempting to start the CRS. which generated lots of log entries.  To avoid the logs filling up the disks we requested system admins to shut down the cluster.

Now to fix the problem only one node was started prddb3.  We disabled the autostart of crs using  (this requires ROOT access)
/etc/init.d/init.crs stop  (to stop the crs stack)
/etc/init.d/init.crs disable. (to disable aiuto start on reboot)

4. Based on analysis in step 2 above, the next step was  to repair the OCR file using the following steps also as user root.

[root@prddb3 bin]# ./ocrconfig -repair ocr /dev/raw/ocr1
[root@prddb3 bin]# ./ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     306968
         Used space (kbytes)      :      12852
         Available space (kbytes) :     294116
         ID                       :  658275539
         Device/File Name         : /dev/raw/ocr1
         Device/File integrity check succeeded

         Device/File not configured

Cluster registry integrity check succeeded

5.  ‘Device/File not configured’?  Then what is the check succeeded message. Isn’t it a bit confusing. We had configured two OCR files and then why is the second file missing?  Realized that the repair command will only repair one OCR file at a time, besides we had only repaired the primary copy.  Next step was to repair the mirror copy

[root@prddb3bin]#./ocrconfig -repair ocrmirror  /dev/raw/ocr2
[root@prddb3 bin]# ./ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     306968
         Used space (kbytes)      :      12852
         Available space (kbytes) :     294116
         ID                       :  658275539
         Device/File Name         : /dev/raw/ocr1
         Device/File integrity check succeeded
         Device/File Name         : /dev/raw/ocr2
         Device/File integrity check succeeded

Cluster registry integrity check succeeded

6.  Now that both the OCR files are fixed we started the clusterware stack using the /etc/init.d/init.crs start command. This brought up the cluterware and the complete stack without any issues.  

7. Now we had to reset the clusterware sstartup process to auto restart on reboot. . (recollect we disabled reboot in Step 3 above).

/etc/init.d/init.crs enable
/etc/init.d/init.crs start