Wednesday, March 07, 2007

Benchmarking node restart time in 5.1

When you have a node failure the best practice is to restart the ndbd node as soon as possible.
This post illustrates what happens with restart times:
  • if you have configured TimeBetweenLocalCheckpoints wrong and have a high load
  • if you don't restart the failed ndbd node immediately
  • how long time it takes to resync 1GB of data with and without ongoing transactions.

5.0 and 5.1 differences in node recovery protocol

In 5.0 the node recovery protocol copies all data from the other node in its node group.
In 5.1, there is a new node recovery protocol, called "Optimized Node Recovery", called ONR below.
When a node has failed, and is restarted, it will recover its local logs (Local Checkpoint (LCP) + redo log) and only copy the changes that has been made on the started node.
This is faster than copy all information as done in 5.0, but which can also happen in 5.1 as we will see.


Current limitations of Optimized Node Recovery

However, if the ndbd node is down for three consecutive Local Checkpoints (LCPs), then it will not be possible to run the ONR protocol, but will resort back to copying all information. This is because both ndbd nodes have to have kind of the same view of the redo log, and the redo log stored locally on the failed node is deemed completely obsolete after three LCPs on the started node.


How does affect configuration?


The start of LCPs depends on the load. Currently 4MB of changes triggers a LCP. If you have a node failure during high load then the likelihood of starting three LCPs before the failed ndbd node is recovered is increased. Thus it invalidates its local redo log and the recovering node will have to copy all info from the other node.
Therefor it is important to set TimeBetweenLocalCheckpoints to a higher value than the default on highly loaded systems.


Verifying correctness of the TimeBetweenLocalCheckpoints

This can easily be benchmarked and verified by doing
ndb_mgm>ALL CLUSTERLOG CHECKPOINT=8
This will print out when a LCP is started and completed in the cluster log.
ndb_mgm>ALL CLUSTERLOG NODERESTART=15
will print out restart info in the cluster log.

When you then have a node failure (or shutdown a node yourself) , verify that three LCPs are _not_ started until the failed ndbd node is recovered. If three LCPs are started before the failed ndbd node is recovered, then you should increase the TimeBetweenLocalCheckpoints to suit the load you have.
I found TimeBetweenLocalCheckpoints=24 (64MB of changes to trigger LCP start) to be a good value on a decently loaded system.

You can also verify, by looking at the ndb_X_out.log of the restarting node that it is using the RESTORE block which reads in its local data instead of copying all data from the other ndbd node. If you see RESTORE print outs there while recovering the node, then you are in good shape.

If you are in the population phase of the database I also suggest you to run:
ndb_mgm>all dump 7099
when the population has been done which will trigger an LCP, just to be on the safe side, so that all ndbd nodes write have completed atleast one LCP. Please note that all dump commands are combined with a certain risk and are not yet productized.


Benchmark results of node restart time

I used the following values in config.ini for these etse

TimeBetweenLocalCheckpoints=24
DiskCheckpointSpeed=[default]


Test 1:
  • 1GB of data in database,
  • 2 NDB nodes,
  • no ongoing transactions
Optimized node recovery = 3m 3sec (~2min spent writing initial LCP, speed of this can be tuned with DiskCheckpointSpeed),

Node recovery= 5min 27 sec (~2min spent writing initial LCP, speed of this can be tuned with DiskCheckpointSpeed)

Test 2:
  • 1GB of data in database
  • 2 NDB nodes,
  • ~2000 updates/sec each updating 512B. (1MB/s)
Optimized node recovery = 4m 42sec (~2min spent writing initial LCP, speed of this can be tuned with DiskCheckpointSpeed)

Node recovery= 7m 24 sec (~2min spent writing initial LCP, speed of this can be tuned with DiskCheckpointSpeed),


Conclusion

Taking out the cost of the initial LCP that must be written after all info has been resynced on the recovering ndbd node, then it is a big difference between Optimized Node Recovery and Node Recovery.

Thus, by having good disks so we can checkpoint faster (DiskCheckpointSpeed) and carefully tuning:
  • DiskCheckpointSpeed
  • TimeBetweenLocalCheckpoints
it is possible to reduce node restart times quite a bit.


Test setup

Tests was carried out on dual CPU, dual core AMD Opteron 275, 1.8GHz, using the following tables:

create table t1(a char(32), b integer unsigned not null, c char(32), d integer unsigned not null, e char(220), f char(220), primary key(a,b), index (d)) engine=ndb;

create table t2(a char(32), b integer unsigned not null, c char(32), d integer unsigned not null, e char(220), f char(220), primary key(a,b), index (d)) engine=ndb;

Each filled up with 1M records using hugoLoad.
Updating transactions were generated with hugoPkUpdate
These two programs can be found in the 5.1 source distribution.