Email: service@parnassusdata.com 7 x 24 online support!

    You are here

Steps to troubleshoot Oracle RAC clusterware GIPCHA connection failure

Steps to troubleshoot Oracle RAC clusterware GIPCHA connection failure

PURPOSE
How to troubleshoot when a local process (non bootstrap client) fails to CONNECT to a peer process using the GIPCHA communication protocol?
 
 
 
TROUBLESHOOTING STEPS
Lets assume that P1 process running on node-1 and P2 process running on node-2 and P1 process made attempt to connect P2 process (assume the connection string is 
gipcha://node-2:foo) using the gipcha communication protocol at timestamp t1 and this CONNECTION was not succeeded . Root cause the CONNECT failure reason as below.
 
1. Check the aliveness of P2 process and GIPCDaemon process on node-2 at timestamp t1. If they are not alive, then it means P1 tried to CONNECT to a non existent process and because of that the CONNECTION request failed.
 
2. If P2 process is alive but GIPCD-2 (GIPCDaemon which runs on node-2) was not alive, then we need to investigate why GIPCDaemon was not spawned by AGENT on node-2. 
 
3. If GIPCDaemon was not running on node-1, then we need to investigate why GIPCDaemon was not spawned by AGENT on node-1.
 
4. If P1, P2, GIPCD-1 (GIPCDaemon which runs on node-1) and GIPCD-2 (GIPCDaemon which runs on node-2) all alive then try the below steps.
 
(a) Investigate whether P2 process is listening at connection string "gipcha://node-2:foo or not at timestamp t1. The log files of GIPCD-2 will help us to identifying this info.
 
 When P2 process creates listen endpoint gipcha://node-2:foo  then you will see log messages in GIPCD-2 which looks as something as below.
 
Example log of GIPCDaemon:
 
2016-05-13 03:15:22.887 :GIPCDCLT:889784064:  gipcdClientThread: req from local client of type gipcdmsgtypeCreateName, endp 0000000000000105
2016-05-13 03:15:22.887 :GIPCDCLT:889784064:  gipcdClientCreateName: Received type(gipcdmsgtypeCreateName), endp(0000000000000105), len(1008), buf(0x7fcd2c0ea1b8):[hostname(node-2), portstr: (foo), haname(d470-975b-4d7b-5095), retStatus(gipcretSuccess)]
2016-05-13 03:15:22.887 :GIPCDCLT:889784064:  gipcdInitPortEntry: port foo entry initialized. port memid : 0000050800000000
2016-05-13 03:15:22.887 :GIPCDCLT:889784064:  gipcdAddPortEntry: added port foo entry to shared memory. port mid: 0000050800000000client memid 000003c800000000, client 5983 incarnation 2
 
 
If node-2 deletes/closed the same listen endpoint, then you will see the below log messages.
 
 
Example log of GIPCDaemon:
 
2016-05-13 03:16:42.962 :GIPCDCLT:889784064:  gipcdClientDeleteName: Received type(gipcdmsgtypeDeleteName), endp(0000000000000105), len(1008), buf(0x7fcd2c0fb728):[hostname(node-2), portstr: (foo), haname(d470-975b-4d7b-5095), retStatus(gipcretSuccess)]
2016-05-13 03:16:42.962 :GIPCDCLT:889784064:  gipcdClientDeleteName: Name deleted(foo)
2016-05-13 03:16:42.962 :GIPCDCLT:889784064:  gipcdDelPortEntry: port foo entry deleted from shared memory. port memid: 0000050800000000
 
In the GIPCDaemon logs , if you see only LISTEN endpoint creation logs but no deletion/close  logs  before timestamp t1, then it means P2 process successfully listening at time stamp t1.
 
(b) If P2 process SUCCESSFULLY listening at the LISTEN endpoint, then we need to investigate whether GIPCD-1 SUCCESSFULLY resolved the P2 connection string or not.
 
 When P1 process makes attempt to CONNECT P2 process, then it needs to resolve the P2 connection string (gipcha://node-2:foo) and for that it sends a LOOKUP request to GIPCD-1.
 
Example log GIPCDaemon:
 
2016-05-13 03:15:52.204 :GIPCDCLT:889784064:  gipcdClientThread: req from local client of type gipcdmsgtypeLookupName, endp 0000000000000105
2016-05-13 03:15:52.204 :GIPCDCLT:889784064:  gipcdClientLookupName: Received type(gipcdmsgtypeLookupName), endp(0000000000000105), len(1008), buf(0x7fcd2c0e9898):[hostname(node2), portstr: (foo), haname(), retStatus(gipcretSuccess)]
 
If GIPCD-1 fails to resolve it locally, then it forwards the lookup request to GIPCD-2. And for forwarding the same it needs to CONNECT GIPCD-2. If GIPCD-1 already connected to GIPCD-2, then immediately it sends the LOOKUP request, if it is not connected, then it first connects to GIPCD-2 and it sends the lookup request as below.
 
2016-05-13 03:15:52.204 :GIPCDCLT:889784064:  gipcdEnqueueMsgForNode: Enqueuing for NodeThread (gipcdReqTypeLookupName)
2016-05-13 03:15:52.204 :GIPCDNDE:887682816:  gipcdProcessClientRequest: Dequeued req for host (node-2), type(gipcdReqTypeLookupName), id (0000000000000105, 0000000000000000), cookie 0x7fcd2c2c5738
2016-05-13 03:15:52.204 :GIPCDNDE:887682816:  gipcdSendReq: recvd msg clnt header: (req: 0x7fcd2c0ea240 [hostname(node-2), id (0000000000000105, 0000000000000000), len(392), req cookie(00007fcd2c2c5738), type(gipcdReqTypeLookupName)])
 
2016-05-13 03:15:52.206 :GIPCDNDE:887682816:  gipcdEnqueueSendReq: Enqueuing the msg in the pending send request for host node-2, msg 0x7fcd2c0ea240
2016-05-13 03:15:52.206 :GIPCDNDE:887682816:  gipcdSendReq: Enqueued the request and waiting for connection to complete with host node-2
 
>>>once GIPCD-1 successfully connects to GIPCD-2, then it forwards the LOOKUP request to GIPCD-2. If GIPCD-1 fails to CONNECT GIPCD-2, then please ask GIPC team to investigate it.
 
2016-05-13 03:15:52.219 :GIPCDNDE:887682816:  gipcdNodeThread: Connection established with hostname node-2, endp: 0000000000001b3a
2016-05-13 03:15:52.220 :GIPCDNDE:887682816:  gipcdNodeSendReq: Sending using id 0000000000001b3a, (nodehdr: req:0x7fcd200b43d0  [version(4107), len(3087073280), type((uknown)), req cookie(0000000000000002)] flags 335544320 clnthdr: req: 0x7fcd2c0ea240 [hostname(node-2), id (0000000000000105, 0000000000000000), len(392), req cookie(00007fcd200b4148), type(gipcdReqTypeLookupName)])
 
>>>After sending the LOOKUP request to GIPCD-2, GIPCD-2 mostly comes back with LOOKUPACK request as below. If LOOKUP ACK is not received then please ask GIPC team to investigate it.
 
2016-05-13 03:15:52.222 :GIPCDNDE:887682816:  gipcdNodeThread: Msg received from endp 0000000000001b3a, req:0x7fcd1c09f438  [version(185597952), len(600), type(gipcdReqTypeLookupNameAck), req cookie(0000000000000002)] flags 20
 
>>> P2 is SUCCESSFULLY listening at LISTEN endpoint but still GIPCD-2 failed to resolve the connection string, then please ask GIPC team to investigate it.
 
(c) If P1 process SUCCESSFULLY resolved the P2 connection string, then you will a message in P1 process which is looks as below.
 
2016-05-13 03:22:13.893 :GIPCHDEM:3670411008: gipchaDaemonCreateeesolveResponse: creating resolveResponse for host: node-2, port:foo, haname:475b-13b9-0f5b-aa6e, ret:0
 
>>>If P1 process SUCCESSFULLY resolved the connection string, then "RET" should be ZERO
 
(d) If P1 process SUCCESSFULLY resolved the P2 connection string but still P1 failed to CONNECT P2 , then we need to investigate whether P1 successfully fetched the P2 interfaces or not. Find the UDP interfaces of P1 and UDP interfaces of P2. When a process creates UDP interfaces for GIPCHA communication, then you will see a log message which looks as below.
 
2016-05-13 03:22:13.828 :GIPCHTHR:3671987968: gipchaWorkerCreateInterface: created local interface for node 'node-1', haName 'CSS_aime1adc00rqh', inf 'udp://10.232.130.185:54140' inf 0x7fcea02cffd0
 
   
If P1 Process fetches the interfaces of P2 process then you will see a log message which looks as below
 
2016-05-13 03:22:13.897 :GIPCHTHR:3671987968: gipchaWorkerCreateInterface: created remote interface for node 'node-2', haName '475b-13b9-0f5b-aa6e', inf 'udp://10.232.130.185:64368' inf 0x7fcea02bc540
2016-05-13 03:22:13.897 :GIPCHGEN:3671987968: gipchaWorkerAttachInterface: Interface attached inf 0x7fcea02bc540 { host 'node-2', haName '475b-13b9-0f5b-aa6e', local 0x7fcea02cffd0, ip '10.232.130.185:64368', subnet '10.232.128.0', mask '255.255.248.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }
 
(e) If P1 process SUCCESSFULLY resolved the P2 process connection string and it successfully fetches the correct interface of P2 process, but still connection is FAILED. Then in that case, please investigate whether P1 process SENT the CONNECT request or not, similarly check whether P2 process ACCEPTED the CONNECT request of P2 or not.
 
If P1 process SUCCESSFULLY sent the CONNECT request then you will see a log message which looks as below.
 
2016-05-13 03:29:08.591 :GIPCHAUP:1253738240: gipchaUpperConnect: initiated connect for umsg 0x7f3f2008cca0 { msg 0x7f3f2005a260, ret gipcretRequestPending (15), flags 0x6 }, msg 0x7f3f2005a260 { type gipchaMsgTypeConnect (3), srcPort '43a4-4456-3ec9-7c86', dstPort 'foo', srcCid 00000000-0000047e, cookie 00007f3f-2008cca0 } dataLen 0, endp 0x1b51380 [000000000000047e] { gipchaEndpoint : port '43a4-4456-3ec9-7c86', peer ':', srcCid 00000000-0000047e,  dstCid 00000000-00000000, numSend 0, maxSend 100, groupListType 1, hagroup 0x17ea3d0, priority 0, forceAckCount 0, usrFlags 0x4000, flags 0x0 } node 0x7f3f143029e0 { host 'aime1-adc00rqh-0', haName '64fb-4631-033f-fb7a', srcLuid dc93ae6c-9bd26476, dstLuid 00000000-00000000 numInf 0, sentRegister 1, localMonitor 0, baseStream 0x7f3f14302e20 type gipchaNodeType12001 (20), nodeIncarnation 395ca385-154924a4, incarnation 2, cssIncarnation 0, roundTripTime 4294967295 lastSeenPingAck 0 nextPingId 1 latencySrc 0 latencyDst 0 flags 0x80000}
 
>>Please check the status of CONNECT umsg "0x7f3f2008cca0"  in P1 process log
 
 
In P2 process logs check whether P2 process accepted the P1 connect or not. If P2 Process accepted the CONNECT request then you will see a log message which looks as below.
 
 
2016-05-13 03:38:45.986 :GIPCHAUP:2620307200: gipchaUpperProcessAccept: completed new hastream 0x7f7c44062a80 { host 'node-1', haName '6304-ac42-e1c3-579e' srcStreamId 00000000-00008526 dstStreamId 00000000-000006e2 , hendp (nil) haNode 0x7f7c440642c0 numInf 1, contigSeq 1, lastAck 0, lastValidAck 0, sendSeq [1 : 1], priority 0,  duplicate recv 0, completed recv 0, completed send 0, total send 0, total recv 1, flags 0x1}  for hendp 0x7f7c44065070 [0000000000008526] { gipchaEndpoint : port 'foo/1910-9a85-f387-c449', peer ':', srcCid 00000000-00008526,  dstCid 00000000-00000000, numSend 0, maxSend 100, groupListType 1, hagroup 0x20731e0, priority 0, forceAckCount 0, usrFlags 0x4000, flags 0x0 }
 
 
(f) Even after trying all the above steps still if it is not clear why CONNECT failed, then please ask GIPC team to investigate it.