I created in total 11345 tables, each with 128 columns, and then hit this bug.
When all tables were created, I stopped one node and measured how long time it takes for the other node to perform the node failure handling. Here is what was written into the cluster log (look at the bold lines):
2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Node 3 Disconnected
2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Network partitioning - arbitration required
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: President restarts arbitration thread [state=7]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: Communication to Node 3 closed
2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 1
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: GCP Take over started
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: Node 4 taking over as DICT master
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: GCP Take over completed
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: kk: 9670/4 0 0
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: LCP Take over started
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: LCP Take over completed (state = 5)
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 3: Node shutdown completed.
2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2009-11-27 13:39:22 [MgmtSrvr] INFO -- Node 4: Started arbitrator node 1 [ticket=2a74000295047d39]
2009-11-27 13:40:20 [MgmtSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 1 min. - state = 6
2009-11-27 13:41:03 [MgmtSrvr] INFO -- Node 4: Communication to Node 3 opened
Now, the node failure handling was completed in 1 minute 42 seconds! This is quite some improvement from the 17 minutes (although measured with 16000 tables and I will comeback when the bug mentioned above is fixed) measured before Jonas made his magic.
What does it mean? It means that we can start to recover a failed data nodes much earlier now!
No comments:
Post a Comment