Recovering from Undo Tablespace Corruption in Oracle Database

Recovering from Undo Tablespace Corruption in OracleIf you have been an Oracle DBA for any length of time, it’s quite likely that you’ve encountered an ORA-1578, ORA-7445, or ORA-600 error during your career. These are indications of a symptom that all DBAs love to hate: corruption. Fortunately, corruption within Oracle database structures doesn’t occur very often, but when it does, it must be dealt with carefully and immediately. It is also crucial to determine the true root cause (or causes) of the corruption and employ proactive measures to ensure it will not reoccur.



 

The simplest technical definition of corruption in an Oracle database is that it is an inconsistency in either a data block or buffer cache memory structure. Fortunately, there are well-known methods to prevent, detect, and repair corruption depending on the type of corruption encountered. The remainder of this blog discusses the appropriate approaches to detecting, repairing, and then proactively preventing corruption in rollback segments in the SYSTEM tablespace and undo segments in an undo tablespace.

 

Preventing, Detecting, and Repairing Corruption

First, it’s important to distinguish two broad types of corruption: memory corruption and block corruption.

An ORA-7445 error is generally indicative of memory corruption, and the root causes of this error are legion: It could be caused by a malfunction in the database server’s dynamic random-access memory (DRAM), storage network, I/O controller memory, or even the internal disk drive’s memory cache. This type of error requires careful testing and retesting to establish the root cause, and it may even require intervention by the appropriate hardware technicians to diagnose, isolate, and repair the ultimate source of the problem.

An ORA-600 or ORA-1578 error typically indicates either logical or physical corruption of a database block or memory buffer. Physical corruption means that the database cannot recognize the block or buffer correctly; perhaps the block header has been corrupted, or its actual size no longer matches its expected size. Logical corruption implies that even though the block or buffer structure is still intact, its contents are essentially gibberish and are no longer legible to the database. As these issues are mostly within an Oracle DBA’s jurisdiction, the remainder of this section focuses on how to prevent, detect, and repair this type of corruption.

Prevention. A smart Oracle DBA spends her time anticipating a problem before it happens. In the case of corruption, she follows recommended best practices, including avoiding single points of failure in the database’s hardware and network, activating appropriate monitoring of the hardware and the database using tools such as Oracle Enterprise Manager 12c Cloud Control, and activating appropriate values for specific initialization parameters to limit or eliminate database corruption.

Detection. Unfortunately, detection of database corruption often first appears when a user reports a strange error message on his screen, usually followed by a frantic search of the corresponding database’s alert log to find the ORA-1578 error that matches the database block that’s been corrupted. However, it’s also possible to detect block corruption outside of application errors in the following ways:

  • During regularly scheduled Recovery Manager (RMAN) backup processing
  • After running the RMAN BACKUP VALIDATE command against a database, tablespace, or datafile, either with or without the optional CHECK LOGICAL clause to identify logical corruption
  • Running the DBMS_REPAIR package against a specific database segment
  • Executing the database verification utility, DBVerify, against a specific datafile

In these cases, it’s likely that the V$DATABASE_BLOCK_CORRUPTION dynamic view will contain a list of the corrupted blocks and that information can be leveraged effectively for repair of the corrupted blocks.

Repair. Once corruption has been detected, it’s the job of an Oracle DBA to repair the damage as quickly as possible. The appropriate method to repair the damage depends on the type of corruption detected, the criticality of the object to the database application, the amount of downtime that the corruption is causing or may cause, the availability of RMAN backups on either disk or tape, and even the specific disaster recovery environment that has been implemented. For example, a corrupted database block on disk in the oldest historical partition of a partitioned table containing data from 20 years ago is obviously of less immediate concern than a block in the hottest partition of the same table.

 

Handling Memory Corruption

When blocks are being read into an Oracle database’s buffer cache, Oracle background processes will verify the blocks being read from disk. The background processes compare the incarnation number (INC) and sequence number (SEQ) in the header of the data block with the INCSEQ structure in the database server’s baseboard to make sure that the block versions are exactly the same. This verification should prevent an Oracle database from reading a block from disk that contains a corrupted block header.

Even so, it is still possible for memory corruption to occur. Here are several possible vectors:

  • Oracle software bugs
  • Operating system software bugs
  • Non-Oracle programs writing to the same memory address being used by Oracle
  • Hardware problems, including failing DRAM components

Prevention. One of the best ways to prevent unnecessary corruption of Oracle database blocks is to set the DB_BLOCK_CHECKSUM initialization parameter to its recommended setting of FULL. The Oracle database will then verify that a block is logically consistent before it’s written back to disk. Be aware, however, that there is a potential performance penalty of up to 10 percent if the application workload is skewed heavily toward DML because of the additional verification processing this block checking entails.

Starting with Oracle Database 11g, it’s even simpler to set the appropriate values to guard against corruption via the DB_ULTRA_SAFE initialization parameter. This parameter accepts just three possible values:

  • As might be expected, OFF sets the lowest acceptable values: DB_CLOCK_CHECKING,
  • DB_BLOCK_CHECKSUM, and DB_LOST_WRITE_PROTECT are deactivated (but can be set separately to appropriate values if desired).
  • DATA_ONLY sets DB_CLOCK_CHECKING to MEDIUM, DB_LOST_WRITE_PROTECT to TYPICAL, and DB_BLOCK_CHECKSUM to FULL.\
  • DATA_AND_INDEX sets DB_BLOCK_CHECKING and DB_CLOCK_CHECKSUM to FULL and DB_LOST_WRITE_PROTECT to TYPICAL.

Detection. Memory corruption is typically detected by monitoring the database instance alert log for ORA-600 errors; however, it’s just as likely that they will be reported when database application users unexpectedly receive an ORA-600 error or when error messages that are related to memory corruption are reported to the Oracle DBA.

Repair. Restarting a database instance is one potential method to clear at least some memory corruption errors. However, the Oracle hanganalyze utility is an excellent tool for collecting more information, and it’s likely that Oracle Support will require more detailed information when opening a service request to solve the memory corruption issues. Collecting hanganalyze information is relatively straightforward, and there are two methods to choose from: SQL*Plus and Oradebug.

 

Invoking Hanganalyze via SQL*Plus

One of the best methods to gather information about database performance problems when the database is hanging is to use the HANGANALYZE procedure as follows:

  1. Start a SQL*Plus session as SYSDBA and run the following command:
ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level 3';
  1. Wait 1 minute and run this same command again:
ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level 3';
  1. Wait 1 more minute and again run the same command:
ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level 3';
  1. Wait 1 more minute and then run this command to set another event:
ALTER SESSION SET EVENTS 'immediate trace name SYSTEMSTATE level 266';
  1. Wait 1 more minute, and run the same event command again:
ALTER SESSION SET EVENTS 'immediate trace name SYSTEMSTATE level 266';
  1. Finally, wait 1 more minute and run the same command once more:
ALTER SESSION SET EVENTS 'immediate trace name SYSTEMSTATE level 266';
  1. Exit your SQL*Plus session, gather all trace files generated in your database’s USER_DUMP directory, and send them on to Oracle Support for expert analysis.

 

Using Hanganalyze without a Persistent SQL*Plus Connection

If you are unable to log in to the affected database with SQL*Plus at the exact moment that the hang occurs, this method will work instead:

 

1. Log in to SQLPLUS using the following command:

sqlplus -prelim / as sysdba

2. Set the pid of your session using the following command:

oradebug setmypid

3. Set the ulimit of this session:

oradebug unlimit

4. Run the HANGANALYZE command:

oradebug hanganalyze 3

5. Wait 3 minutes and run it again:

oradebug hanganalyze 3

6. Wait 3 more minutes and run it once more:

oradebug hanganalyze 3

7. Wait 3 more minutes, then run this command:

oradebug dump systemstate 10

8. Now wait 7 minutes and run the same command again:

oradebug dump systemstate 10

9. Wait 7 minutes and run the same command once more:

oradebug dump systemstate 10

10. Cancel the tracing, stop the session, and send the appropriate files from USER_DUMP_DEST to

Oracle Support.


 

Note 

Handling Logical Corruption

Poorly designed applications can cause logical corruptions because they do not check integrity of data. Some bugs related to the Oracle optimizer could also lead to logical corruptions.

Prevention. Excessively test application code before deploying it to a production environment. Also, be sure to keep your Oracle database patched to the most recent patch set release and strive to upgrade databases to the most stable version for your current release.

Detection. Logical corruptions caused by application errors are difficult to detect; the best professionals to find errors of this kind are the developers of the application in question.

Repair. The same professional who finds the error—usually, the application developer—will be the one to fix it.

 

Overcoming Media Corruption

Media corruption has several potential vectors, including (but not limited to) the following:

  • Hardware problems in the database server, storage area network, or physical storage components
  • Misconfigured parameters in the operating system storage area network
  • I/O controller issues
  • Logical volume issues
  • User errors (e.g. a user inadvertently replaced or deleted a datafile)
  • Unexpected bugs in Oracle database or Automatic Storage Management (ASM) software
  • Bugs in the database server’s operating system software

Prevention. Depending on the cause, here are some recommended best practices to prevent media corruption:

  • Practice the principle of least privilege to ensure that no one can inadvertently overwrite or delete a datafile.
  • Have a good backup policy, and always test a full restore at regular intervals to ensure full database recoverability can be performed when it is least expected.
  • Install a monitoring tool such as Oracle Grid Control 11g or Cloud Control 12c so alerts can be raised whenever an ORA error occurs, or deploy shell scripts to monitor the alert log.
  • Always make changes to your database in a test environment before deploying any application changes that might cause corruption.

Detection. ORA errors will be found in alert log files, and sometimes users accessing the database will report these errors. In addition, DBVerify can be used to find physical corrupted blocks in datafiles.

Repair. Datafile block corruption can be repaired in many different ways, and the appropriate method depends on the type of block that needs recovery.

 

File Header Block Corruption

As the name implies, the header block keeps track of information about the datafile as well as the status of the file and the last time that a RESETLOG operation was performed. Also, every time a checkpoint occurs, the file header block is updated with checkpoint information.

Detection. Fortunately, there are several ways to detect file header block corruption:

  • Review the database’s alert logs for any ORA errors.
  • Run DBVerify against datafiles.
  • Issue the ALTER SYSTEM CHECK DATAFILES command, and then check the database’s alert log afterwards to see if any errors were generated.
  • Issue SELECT FILE#, ONLINE_STATUS from V$RECOVER_FILE to check for any datafile that needs recovery, and then review the output to find any information about corrupted blocks.
  • The error ORA-1578 is always generated when a file header becomes corrupted, and this error message will display the block number and data file number of the affected datafile.

Repair. The only repair option in this situation is to restore the block from a valid datafile backup.

 

Data Dictionary Object Block Corruption

Data dictionary object blocks are crucial because they belong to objects that reside in the SYSTEM tablespace and contain information that the database cannot live without—the objects that comprise the database itself.

Detection. Again, a couple of methods are available to detect block corruption:

  • Running DBVerify against the SYSTEM tablespace’s datafiles will identify which blocks are corrupted.
  • The database’s alert log file will show ORA-1578 errors related to the SYSTEM tablespace’s datafiles.

Repair. As the SYSTEM tablespace is absolutely crucial to the database’s operation, this situation must be dealt with immediately upon detection of corruption.

The best method is to restore the SYSTEM tablespace from a valid backup and recover the block/datafile. However, remember that this recovery will require shutting down the database instance and restarting it in MOUNT mode before recovery can continue.

If you have a disaster recovery site provided by either Data Guard physical replication or GoldenGate logical replication, you may consider switching over to that system until the data dictionary corruption is repaired.

 

Undo Header and Undo Block Corruption

Transactions utilize undo segments in two cases:

  • If the database detects that a transaction has failed, it will automatically roll back that transaction to maintain data integrity.
  • When an application user explicitly cancels the transaction through a web browser or other interface, the application issues the ROLLBACK command to undo everything that was committed within the current transaction.

Undo segments are therefore crucial for transaction consistency, so any undo tablespace corruption issues must be handled with extreme urgency.

Detection. Again, several methods are available to detect block corruption:

  • Run DBVerify on the undo tablespace’s datafile to discover any corruption.
  • The database’s alert log will register ORA errors when undo corruption is detected. If the corrupted undo segment is not offline, an ORA-1545 is typically raised, while a query that is unable to locate the appropriate version of a block from an undo segment for read consistency will typically raise an ORA-1578 error.
  • Query the V$ROLLSTAT dynamic view; if any undo segment shows a status of NEEDS RECOVERY, then undo corruption is likely.

Repair. Recovering from undo segment corruption is a bit trickier than other corruption recovery scenarios.

If an undo segment is found to contain a corrupted block, a good first attempt to solve the problem is to simply take the undo segment offline and then drop the segment.

If that method doesn’t work, or if it is impossible to drop the undo segment, then it will be necessary to restore and recover the undo tablespace from a valid backup. However, as with SYSTEM tablespace recovery, doing so will require shutting down the database instance and restarting it in mount mode before recovery of the undo tablespace can continue.

One unsupported repair method involves the following procedure:

  1. If the corrupted undo block is part of a table, index, or cluster, then activate the following event. It will write information about the object number into a trace file in the USER_DUMP_DEST directory for the current session:
ALTER SYSTEM SET EVENT ”10015 trace name context forever, level 10”;
  1. Review the trace file to find the object number, and then run the following query:

 

SQL> SELECT owner, object_name, object_type, status
FROM dba_objects
WHERE object_id = <object# from trace file>;

  1. Drop the object and restore only this object from a valid backup.

If the block belongs to the undo header, then the repair is much more tedious:

  1. List the corrupted undo segment by setting unsupported parameter _CORRUPTED_ROLLBACK_SEGMENTS to a value of rollback segments’ names.
  2. Run a full database export.
  3. Create an empty database.
  4. Run a full database import.

 

Summary

The undo tablespace is a vital part of any Oracle database after Oracle Database 10g Release 1. It’s therefore crucial for an Oracle DBA to be able to detect, diagnose, and repair any problem related to the undo tablespace and its corresponding undo segments and extents. If the DBA ignores any of the potential issues has described, database applications may encounter unacceptable response times, the database instance may unexpectedly hang, and the database may even stop operating completely or be impossible to open or reopen. Following are the best practices:

  • Be sure that you understand the intricacies of how an Oracle database uses its undo tablespace, including how the UNDO_RETENTION parameter setting determines the potential size of an undo tablespace as well as how long an unexpired undo extent will be retained for purposes of read consistency.
  • Use the Undo Advisor tool. Whether accessed via Oracle Enterprise Manager or the undocumented DBMS_UNDO_ADVISOR package, it can be invaluable for tuning undo tablespace usage and related initialization parameter settings.
  • Leverage the hanganalyze utility to accurately ascertain the cause of an apparent database hang and decide which approach to overcome the hang is most warranted.
  • Learn how to distinguish the types and potential causes of block media corruption, and be prepared to detect, diagnose, and correct corruption in any circumstance because this type of corruption can cause an Oracle database to hang and may even require database downtime to repair the damage.

 

Вас заинтересует / Intresting for you:

Oracle Database and Security: ...
Oracle Database and Security: ... 870 views sepia Mon, 26 Nov 2018, 12:20:58
Oracle Database and Instance d...
Oracle Database and Instance d... 507 views Masha Thu, 21 Jun 2018, 18:23:39
Basic Oracle Database 12c Secu...
Basic Oracle Database 12c Secu... 799 views Zero Cool Sat, 06 Oct 2018, 13:53:13
Selecting Table Data Types App...
Selecting Table Data Types App... 673 views Ded Sun, 19 Aug 2018, 12:33:49

Comments on Recovering from Undo Tablespace Corruption in Oracle Database

Be the first to comment
Please login to comment