OCR Repair – yet another scenario

Few months back I had written about an situation when the OCR file was corrupt and we had to repair the file to fix issues.  We ran into yet another scenario in our 10gR2 RAC environment.  But this time we really did not notice any accessibility issues for the CRS, because the CRS stack would get started just fine however the database, instances and or the database services would not get started by the CRS.  

While the basic solution is to restore the OCR file from a clean backup or export file, the troubleshooting and maintenance tasks maybe a bit different. Checking the status of the applications managed by the clusterware we noticed that the database, instance and database services did not start.  

HA Resource                                        Target          State
———–                                                 ——             —–
ora.TSTDB.TSTDB1.inst                    ONLINE      OFFLINE on tstdb1
ora.TSTDB.TSTDB2.inst                    OFFLINE     OFFLINE on tstdb2
ora.TSTDB.TSTDB3.inst                    OFFLINE     OFFLINE on tstdb3
ora.TSTDB.EXTVEN.TSTDB1.srv  OFFLINE     OFFLINE on tstdb1
ora.TSTDB.EXTVEN.TSTDB2.srv  OFFLINE     OFFLINE on tstdb2
ora.TSTDB.EXTVEN.TSTDB3.srv  OFFLINE     OFFLINE on tstdb3
ora.TSTDB.EXTVEN.cs                      OFFLINE     OFFLINE on tstdb1
ora.TSTDB.db                                        OFFLINE     OFFLINE on tstdb1
ora.tstdb1.ASM1.asm                         ONLINE      ONLINE  on tstdb1
……………………….
……………………….  

Repeated attempts to start the database and instances using srvctl did not help.  So we opted to do some troubleshooting.  

Check the database and instances manually:  Using SQL Plus try to start a database instance. If this works then ASM and DB are in good health and its just the CRS that is not able to start or display the status of the database. Its not able to retrieve the database/instance/service definitions from the OCR file and perform the required operations.  

As user ‘root’, shutdown crs on all servers..   

 /etc/init.d/init.crs stop 

Check if the /etc/oracle/ocr.loc file has the following entries..  

[oracle@tstdb2 mvallath]$ cat /etc/oracle/ocr.loc   
ocrconfig_loc=/dev/raw/ocr1                           <== PRIMARY COPY OF THE OCR FILE  
ocrmirrorconfig_loc=/dev/raw/ocr2            <== MIRRORED COPY OF THE OCR FILE  
local_only=FALSE  

If the cluster detects that the OCR file is corrupt it will disable the corrupt file and continue with the mirrored copy.  In which case you may see the ocr.loc file was altered by the clusterware.  Most of the time the mirrored copy should be good and the clusterware should start all applications.  However in this case the mirrored copy was only partially good.  All components of the clusterware stack and applications with the exception of the database, instances and database services were up and visible to the CRS.  

[oracle@tstdb1 ~]$ cat /etc/oracle/ocr.loc  
#Device/file /dev/raw/ocr1 being deleted   <==ALTERED BY CLUSTERWARE..   
ocrconfig_loc=/dev/raw/ocr2                          <== ALTERED BY CLUSTERWARE..   
local_only=false  

Please take note from the output above, that clusterware not only marked the primary copy as ‘deleted’ but also renamed the mirrored copy as the new primary copy and changed the local_only=false (false was in upper case earlier). Before restoring the good copy of the OCR file, the ocr.loc file needs to be changed to reflect its original state. Manually edit this file with the appropriate entries or get a good copy of the ocr.loc (if other servers have not been modified) moved from one of the other servers to this location.  

Normally under such circumstances, CSS could coredump.  The dump file is located in $ORA_CRS_HOME/log//cssd directory.  Analyzing the coredump we also found the following string “#Device/file %s being deleted  

Once the ocr.loc file is updated on all servers in the cluster, import the OCR from a latest export file as user ‘root’  

/app/oracle/product/crs/ocrconfig -import /home/oracle/mvallath/OCRexp28Apr2010.dmp  

If you do not have an good export .. you can get a good copy of the backup.. (clusterware performs an automatic backup of the OCR file every 4 hours)  and restore the OCR file..  

/app/oracle/product/crs/ocrconfig -showbackup  
tstdb1     2010/04/28 09:05:52     /app/oracle/product/crs/cdata/tstcw 

List all the backups found in the location from the above output..      

[oracle@tstdb1 ~]$ ls -ltr /app/oracle/product/crs/cdata/tstcw
total 50328
-rw-r–r– 1 root root 8572928 Apr 27 21:05 day.ocr
-rw-r–r– 1 root root 8572928 Apr 28 01:05 week.ocr
-rw-r–r– 1 root root 8572928 Apr 28 01:05 day_.ocr
-rw-r–r– 1 root root 8572928 Apr 28 01:05 backup02.ocr
-rw-r–r– 1 root root 8572928 Apr 28 05:05 backup01.ocr

8572928 Apr 28 09:05 backup00.ocr root root–r– 1 -rw-r

  

Find the best version of the OCR backup file based on when the clusterware was stable and restore using the following command:  
/app/oracle/product/crs/ocrconfig -restore /app/oracle/product/crs/cdata/tstcw/day.ocr  

Start the clusterware on all nodes as user ‘root’
/etc/init.d/init.crs start  

Note:  Occasionally once the restore or import is completed, clusterware could go into a panic mode and reboot the server.  This is fine and is normal behavior, this normally happens when crs has not completely shutdown and has a hanging daemon on one of the servers.  

This should start the clusterware and all the applications managed by the CRS.  once the clusterware components have been verified and database has been checked.. It would be a good practice to export the OCR file..as a good copy.  

Thanks to Baldev Marepally for helping trouble shoot this issue..  

About Murali Vallath
Murali Vallath has over 20 years of experience designing and developing databases. He provides independent Oracle consulting services focusing on designing and performance tuning of Oracle databases through Summersky Enterprises (www.summersky.biz). Vallath has successfully completed over 100 successful small, medium and terabyte sized RAC implementations (Oracle 9i, Oracle 10g & Oracle 11gR2 ) for reputed corporate firms. Vallath is a regular speaker at industry conferences and user groups, including the Oracle Open World, UKOUG and IOUG on RAC and Oracle RDBMS performance tuning topics. Vallath's Publications: Author: 1. ‘Oracle Real Application Clusters’ Publisher: Digital Press 2. ‘Oracle 10g RAC, Grid, Services & Clustering’ Publisher: Digital Press. Co-Author 3. 'Automatic Storage Management Publisher: Oracle Press'

One Response to OCR Repair – yet another scenario

  1. This is exactly the 2nd blog post, of your site I read through.
    Yet I like this specific one, “OCR Repair – yet
    another scenario Summersky RAC Notebook” the very best.
    Cya -Crystle

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 189 other followers

%d bloggers like this: