Understanding and fixing ORACLE ASM errors ORA-600 [kfcChkAio01] and ORA-15196.

Posted by PDSERVICE on Mar 15, 2017 In

If you cannot recover data by yourself, ask Parnassusdata, the professional ORACLE database recovery team for help.

Parnassusdata Software Database Recovery Team

Service Hotline: +86 13764045638 E-mail: [email protected]

Symptoms

Errors ORA-600 [kfcChkAio01] and ORA-15196 can be reported, after a NON-CLEAN dismount of the diskgroup, normally caused by a crash of the ASM instance.

During the restart of ASM instance and mounting the diskgroup, following messages will be reported on the alert.log of the ASM instance:

* Messages indicating recovery:

NOTE: starting recovery of thread=1 ckpt=201.9904 group=2
NOTE: starting recovery of thread=2 ckpt=139.4186 group=2

* The messages about the error ORA-600 and ORA-15196:

Tue Dec 16 03:00:51 2008
Errors in file /u01/app/oracle/product/10.2.0/asm/admin/+ASM/udump/+asm2_ora_15305.trc:
ORA-00600: internal error code, arguments: [kfcChkAio01], [], [], [], [], [], [], []
ORA-15196: invalid ASM block header [kfc.c:5552] [endian_kfbh] [2079] [2147483648] [1 != 0]
Abort recovery for domain 2
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup FLASH

As a result the diskgroup is dismounted. Subsequent mounts will report same set of errors.

Bug 7589862 was created for this case.

Cause

For the diagnostic and identification of the problem, there are important parts of information dumped into the trace file generated by the errors

The call stack on the trace

kfcChkAio <- kfcGet0 <- kfcGet1Priv <- kfcRcvGet <- kfcema <- kfrPass2 <- kfrcrv <- kfcMount <- kfgInitCache <- kfgFinalizeMount <-
kfgscFinalize <- kfgForEachKfgsc <- kfgsoFinalize <- kfgFinalize <- kfxdrvMount <- kfxdrvEntry

Functions on the call stack indicate the operations like mount diskgroup (kfxdrvMount) and Recovery (kfrcrv)

Description of the errors

ORA-00600: internal error code, arguments: [kfcChkAio01], [], [], [], [], [], [], []

kfcChkAio01 will be signaled if the IO operation failed because an invalid block.

ORA-15196: invalid ASM block header [kfc.c:5552] [endian_kfbh] [2079] [2147483648] [1 != 0]

This error is reported when block failed the validation. The arguments:


endian_kfbh	is the first field on the block header. This is the field that missed the validation.
2079	Is the asm file number. Note that this value will be different on each case
2147483648	The block number found on kfbh.block.blk, other field on the block header. Converted to hex, the bytes on the right reference the block number. 0X80000000
1 != 0	1 was the value found on the field referenced on the first argument, but 0 was the expected value.

The trace file will have the information about the Cache Element and Buffer header affected by the error:

Start recovery for domain 2, valid = 0, flags = 0x4
NOTE: starting recovery of thread=1 ckpt=201.9904 group=2
NOTE: starting recovery of thread=2 ckpt=139.4186 group=2
CE: (0xc0000000153d0bb8) group=2 (FLASH) obj=2079 blk=0 (indirect)
hashFlags=0x0100 lid=0x0002 lruFlags=0x0000 bastCount=1
redundancy=0x11 fileExtent=0 AUindex=0 blockIndex=0
copy #0: disk=0 au=7492
BH: (0xc0000000153a54d0) bnum=322 type=rcv reading state=rcvRead chgSt=not modifying
flags=0x00000000 pinmode=excl lockmode=null bf=0xc000000015141000
kfbh_kfcbh.fcn_kfbh = -1.-1826817 lowAba=0.0 highAba=0.0
last kfcbInitSlot return code=null cpkt lnk is null

From the Cache Element, it is possible to identify the disk and allocation unit involved with the error:

copy #0: disk=0 au=7492

From the alert.log is possible to identify the path of the disk. Review the file back in time and identify the last time diskgroup was mounted without errors. Check for messages like:

NOTE: cache opening disk 0 of grp 2: FLASH_0000 path:/dev/rdsk/c29t1d4

* The second argument of error ORA-15196 indicate the ASM file number involved with the problem. This can be also validated by some of the information printed in the trace file, searching for the words KSTDUMP In memory trace dump:

KSTDUMP: In-memory trace dump
TIME(usecs):SEQ# ORAPID SID EVENT OP DATA
========================================================================
88894E39:000E0839 16 255 10495 20 kfcMoveLRU: gn=2 fn=2079 indblk=218 src=5 dest=2 line=3201
88894E39:000E083A 16 255 10495 3 kfcAddPin: pin=267 kfc.c 3289 excl bnum=189 class=0
88894E3B:000E083B 16 255 10495 10 kfcbpInit: gn=2 fn=2079 indblk=219 pin=268 excl rcvRead kfr.c 5524
88894E3C:000E083C 16 255 10495 12 kfcFlush: bnum=190 kfc.c 3179
88894E3C:000E083D 16 255 10495 11 kfcMakeFree: bnum=190 flags=00000000 kfc.c 3180
88894E3D:000E083E 16 255 10495 19 kfcMoveBucket: [ gn=2 fn=2079 indblk=26 ] --> [ gn=2 fn=2079 indblk=219 ]

From this line:

88894E39:000E0839 16 255 10495 20 kfcMoveLRU: gn=2 fn=2079 indblk=218 src=5 dest=2 line=3201

gn=2 is the diskgroup number
fn=2079 is the ASM file Number
indblk=218 is the block where the indirect extent is stored


gn=2	is the diskgroup number
fn=2079	is the ASM file Number
indblk=218	is the block where the indirect extent is stored

All the references on the In-memory trace dump will be for 256 blocks of the same file, in this case 2079.

Validating the content of Allocation Unit, using kfed

Using kfed to dump the blocks on the Allocation Unit referenced on the Cache Element will show invalid data:

$kfed read /dev/rdsk/c29t1d4 aunum=7492 blknum=0 ausize=1048576|more

kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 66 ; 0x001: 0x42
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 89088 ; 0x004: T=0 NUMB=0x15c00
kfbh.block.obj: 11626 ; 0x008: TYPE=0x0 NUMB=0x2d6a
kfbh.check: 2182659237 ; 0x00c: 0x8218bca5
kfbh.fcn.base: 4293140479 ; 0x010: 0xffe41fff
kfbh.fcn.wrap: 4294967295 ; 0x014: 0xffffffff
kfbh.spare1: 4294967247 ; 0x018: 0xffffffcf
kfbh.spare2: 4294967295 ; 0x01c: 0xffffffff

All 256 (0 through 255) will have similar content. The type will be KFBTYP_INVALID which indicates content/type of the block is incorrect.

The reason of these errors is because during a file creation, ASM incorrectly commits the allocation of an indirect extent before pre-formatting the extent to contain valid blocks. Thus if a crash occurs during the middle of this operation, during recovery the blocks for the indirect extents are found unformatted (kfbh.type: 0 ; 0x002: KFBTYP_INVALID), signaling the errors already mentioned.

Solution

If the patch is not available, the block has to be manually modified. Please carefully follow the procedure described next.

1. Download file patch.zip and copy to any directory on the server running ASM.

( If the downloaded patch.sh is giving any error for some reason, you can just copy/paste the patch.sh script as mentioned below in this document and run it after necessary modifications )

The zip file contains two files:

empty_indirect.txt: which is the valid format of a indirect block.
path.sh: is a shell script used to patch the Allocation Unit having the blocks with the incorrect format.

2. Edit file empty_indirect.txt to make the following changes:

The modifications to the file apply to few fields from the block header.

kfbh.endian: 1 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 2147483648 ; 0x004: T=1 NUMB=0x0
kfbh.block.obj: 2901 ; 0x008: TYPE=0x0 NUMB=0xb55

kfbh.endian:

Possible values are:

1 for little endian processors
0 for big endian processors

Here is a list of the platforms:

PLATFORM_ID	PLATFORM_NAME	ENDIAN_FORMAT
4	HP-UX IA (64-bit)	Big
1	Solaris[tm] OE (32-bit)	Big
16	Apple Mac OS	Big
3	HP-UX (64-bit)	Big
9	IBM zSeries Based Linux	Big
6	AIX-Based Systems (64-bit)	Big
2	Solaris[tm] OE (64-bit)	Big
18	IBM Power Based Linux	Big
17	Solaris Operating System (x86)	Little
12	Microsoft Windows 64-bit for AMD	Little
13	Linux 64-bit for AMD	Little
8	Microsoft Windows IA (64-bit)	Little
15	HP Open VMS	Little
5	HP Tru64 UNIX	Little
10	Linux IA (32-bit)	Little
7	Microsoft Windows IA (32-bit)	Little
11	Linux IA (64-bit)	Little

kfbh.block.obj:

This is the asm file number that was been created during the failure. It is the third argument referenced on error ORA-15196

Because this example was on HP Itanium, with ASM file Number 2079, the header of the block on file empty_indirect.txt should looks like this:

kfbh.endian: 0 ; 0x000: 0x01
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 2147483648 ; 0x004: T=1 NUMB=0x0
kfbh.block.obj: 2079 ; 0x008: TYPE=0x0 NUMB=0xb55

When modifying files generated by kfed, it is required only to change the value on the left of the ';'.

2. Modify script patch.sh

i=0

	while [ $i -le 255 ]

	do

	echo "write block $i"

	kfed write ausz=1048576 blksz=4096 aunum=<AU#> blknum=$i dev=<path for ASM disk> text=/tmp/empty_indirect.txt

	i=`expr $i + 1`

	done
	i=1

	while [ $i -le 255 ]

	do

	echo "merge block $i"

	blk=`expr 2147483648 + $i`

	echo "kfbh.block.blk: $blk" > /tmp/merge

	kfed merge ausz=1048576 blksz=4096 aunum=<AU#> blknum=$i dev=<path for ASM disk> text=/tmp/merge

	i=`expr $i + 1`

	done

The code in file patch.sh execute two changes:

All the blocks in the allocation unit are replaced with the valid format for an indirect block. This is executed in the first loop.
The second loop adjust the correct value for field kfbh.block.blk. It includes the block number.

This script needs to be adapted for every particular case. The changes required are:

aunum=<AU#>.

The Allocation Unit number is reported on the trace file generated by error ORA-600 and ORA-15196, right on the CE and BH area. It's the last line of the CE dump and before the BH.

CE: (0xc0000000153d0bb8) group=2 (FLASH) obj=2079 blk=0 (indirect)
hashFlags=0x0100 lid=0x0002 lruFlags=0x0000 bastCount=1
redundancy=0x11 fileExtent=0 AUindex=0 blockIndex=0
copy #0: disk=0 au=7492

In this example is Allocation Unit 7492.

dev=<path for ASM disk>

This is the full path of the ASM disk number. The CE dumps together with the Allocation Unit number,the disk number. Before in the note was explained how to find the complete path of the disk reviewing the alert.log of the ASM instance. Using v$asm* views is not an option because diskgroup if diskgroup is dismounted.

ausz=1048576.

It will be extremely important to specify the correct size of the Allocation Unit of the diskgroup.

For this example, the version of patch.sh will be:

i=0

	while [ $i -le 255 ]

	do

	echo "write block $i"

	kfed write ausz=1048576 blksz=4096 aunum=7492 blknum=$i dev=/dev/rdsk/c29t1d4 text=/tmp/empty_indirect.txt

	i=`expr $i + 1`

	done
	i=1

	while [ $i -le 255 ]

	do

	echo "merge block $i"

	blk=`expr 2147483648 + $i`

	echo "kfbh.block.blk: $blk" > /tmp/merge

	kfed merge ausz=1048576 blksz=4096 aunum=7492 blknum=$i dev=/dev/rdsk/c29t1d4 text=/tmp/merge

	i=`expr $i + 1`

	done

3. Execute script patch.sh

4. Validate that blocks on the Allocation Unit have now the format of indirect extents block

Following with the example used on this note:

kfed read ausz=1048576 blksz=4096 aunum=7492 blknum=0 dev=/dev/rdsk/c29t1d4 |more

The output should be like:

kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 130 ; 0x001: 0x82
kfbh.type: 12 ; 0x002: KFBTYP_INDIRECT
kfbh.datfmt: 1 ; 0x003: 0x01
kfbh.block.blk: 2147483648 ; 0x004: T=1 NUMB=0x0
kfbh.block.obj: 2079 ; 0x008: TYPE=0x0 NUMB=0x81f

5. After this, diskgroup should operate without problems.

You are here

Understanding and fixing ORACLE ASM errors ORA-600 [kfcChkAio01] and ORA-15196.