Wednesday, September 30, 2009

Upgrade to 7.0.8 (with the Configurator)

MySQL Cluster 7.0.8 was released as a source distribution 30th of Sept 2009. You should upgrade if you can build from source or use the Configurator.

See below how to upgrade when using the Configurator.

7.0.8 contains a number of fixes in almost every area and you are recommended to upgrade if you experience problems with earlier version(s). Especially if you have encountered the "error 306" problem.

The Configurator v2.9 has been updated to use this version.

If you are already using the Configurator and build from source you can upgrade from MySQL Cluster 7.0.7 to 7.0.8 (and you are recommended to upgrade because of the following changes/fixes) in four steps. Here is how:

1. run the upgrade-707-to-708-src.sh script (put it in install/ and chmod u+x ./upgrade-707-to-708-src.sh):
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/install
[cluster01]# chmod u+x ./upgrade-707-to-708-src.sh
[cluster01]# ./upgrade-707-to-708-src.sh
Upgrading scripts
done - now run ./download-and-compile.sh
2. run the script download-and-compile.sh
[cluster01]# ./download-and-compile.sh
After some time it will finish compiling:

mysql-7.0.8-linux-x86_64/sql-bench/test-alter-table
mysql-7.0.8-linux-x86_64.tar.gz created
Removing temporary directory
Downloaded binary distribution to ../../repo
(the 'downloaded binary' should really read 'Copied binary').

3. When you get prompted with the following question answer 'y' (yes):
Do you want to install the binaries on the hosts now
(else run 'install-cluster.sh' later)? (y/n):y
Important! Do not run the the bootstrap.sh script!!
4. change directory to the scripts/ to directory and run the ./rolling-restart.sh script
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/
[cluster01]# ./rolling-restart.sh
This in an online procedure so it requires no downtime of the Cluster.
If you are using binary distributions then you have to wait, because they are not ready yet.
Good luck!

Ps - Configurator 2.9 (released 2nd of September 2009) contains some minor fixes:
  • Adding of [mysqld] slot for administration purposes
  • ./start-backup now backup the configuration files (config.ini, my.cnf and some internal files) as well.

Wednesday, September 16, 2009

Node failure handling

When a data node fails, the other data nodes must perform clean-up of the failed node.

E.g, to abort transactions in the PREPARE phase (Cluster uses a variant of the Two Phase Commit protocol) that has been started on the Data node that fails.
Various protocols (if the failed data node is the master) must also fail-over.

These things are carried out quite swiftly (configurable, but a couple of seconds by default but it also depends on the size of transactions that in the prepare phase that needs to be aborted).

I wanted to check how long time it takes to do node failure handling with many tables (16000+) in the cluster, because in a "normal" database with a couple of hundred tables, then the time spent on node failure handling is not a factor.

In this case I have two data nodes:
  • Node 3 (master)
  • Node 4
I have created 16000+ tables each with 128 columns just to see what happens with failure handling and recovery.

Node 3 was killed, and the Node 4 then has to do node failure handling:
2009-09-16 14:23:41 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 15 min. - state = 6

2009-09-16 14:24:41 [MgmSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 16 min. - state = 6

2009-09-16 14:25:41 [MgmSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 17 min. - state = 6

2009-09-16 14:26:13 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened
Eventually, we get the message :
2009-09-16 14:26:13 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
We cannot start to recover Node 3 until this message has been written out in the cluster log!

If you try to connect the data node before this, then the management server will say something like "No free slot found for node 3", which basically means in this case that the node failure handling is ongoing.

Now we have got the message and we can now start data node 3 again and it will recover

Why is it taking long time:
  • The surviving data nodes that are handling the failures must write for each table that node three has crashed. This is written to disk. This is what takes time when you have many tables (about 17 minutes with >16000 tables) so about 1000 tables per minute.. So in most databases it does not take that much time. Thanks Jonas for you explanation.
However, it is important for you to understand that during this time it is STILL POSSIBLE to WRITE / READ to the Cluster, i.e., the database is online.

Thursday, September 03, 2009

MySQL Cluster on two hosts - options and implications

Considering that you have the following hosts and MySQL Cluster deployed in the following ways:

A: ndb_mgmd, ndbd , mysqld
B: ndb_mgmd, ndbd , mysqld

Now you have a couple of options on how to handle potential network partitioning/split brain which could occur if the computer A and B lose contact with each other. Let's see the implications of these options.

Option 1: ArbitrationRank=1, no NIC bonding
  • If ndb_mgmd on A is elected as the arbitrator and host A crashes, then data node on host B will also die (cannot reach arbitrator).
  • If ndb_mgmd on A is elected as the arbitrator and host B lose contact with host A, then the data node on B will shut down since it can't reach the arbitrator. For further discussions, call this situation X.
  • If you are in situation X and if you restart data node on B and you have not fixed the link), then it will start isolated (after some timeouts have kicked in, StartPartionedTimeout and StartPartialTimeout). Now you are in situation Y.
  • If you had an application writing data (and assume this application crashed when the connection from B -> A was lost) then the following can happen if you are in situation Y
  • If you do "select count(*) from t1" on mysqld host A, then it can differ from a "select count(*) from t1" on mysqld on host B, because the data node on B never got the message to flush the redo buffer to the redo log.
    Now you have ended up in a very bad situation. Kill data node on host B. Fix network, restart data node with --initial.
  • However, this could be combined with STONITH to kill off one of the data nodes.
Option 2: ArbitrationRank=0, no NIC bonding
  • no arbitrators are used - no protection from split brain.
  • if host A lose contact with host B, then the data nodes will be split
  • Application can write to A and B, but e.g A will _never_ see the changes from B, and vice versa
  • The data nodes will drift apart and be out of sync (inconsistent). Very bad.
  • When the link comes up again the data nodes A and B will not reconnect to each other.
  • System administrator must kill one of the data nodes and restart it with --initial.

Option 3: ArbitrationRank=0, NIC bonding + redundant switch.
  • no arbitrators are used - no protection from split brain.
  • Same problems as Option 2) if a) both network links are broken or b) both switches crashes. However, the likelihood of this happening is very small.
Option 4: Aribtration=WaitExternal (new in MySQL Cluster 7.0.7)
  • External arbitrator is used: You, a sys admin, a dba, or a process.
  • When the data nodes lose contact with each other, they are read-only for ArbitrationDelay before unlocking (and becoming writable).
  • This means that you have ArbitrationDelay (configurable) time to decide which data node that should live and who should die.
  • The data node that you killed can then be restarted with "ndbd", i.e, without --initial.
  • "You" in this case can be an external program, sysadmin etc..
  • If you use an external program must run on a third computer.
Option 5: ArbitrationRank=1 and have ndb_mgmd on remote site
  • If you use asynchronous replication between SITE_A and SITE_B then it could be possible to have the management server there with ArbitrationRank=1, and then potentially local management servers on host A and B that only has configuration data (i.e, ArbitrationRank=0 for those).
  • This must be tested to see if the link between SITE_A and SITE_B is "good enough". The link should also be reliable.
Option 6: ArbitrationRank=1 and have a third host (called C)
  • This is the best and recommended approach (and best if even have a computer D that is like C for redundancy matters).
  • Run the management server on host C. Setup would look like:
    A: ndbd + mysqld
    B: ndbd + mysqld
    C: ndb_mgmd (AribtrationRank=1) + mysqld
And the winner is...

What is the "best" option:
  1. Option 6) - get a third computer - simple
  2. Option 3), Option 4) - "you do it, but be careful"
  3. Option 5) - must be tested
  4. Option 1) (atleast the cluster will crash avoiding inconsistencies)
  5. Option 2)
I am arguing with myself whether or not Option 1) should be on either 2nd or 3rd place, atleast it is simple and easy to understand.

Of course redundant NICs (bonding) and redundant switches can be used in all of the options above, even it is not explicitly said so for some of them.

Upgrade to 7.0.7 (with the Configurator)

MySQL Cluster 7.0.7 was released as a source distribution 1st of Sept 2009. You should upgrade if you can build from source or use the Configurator.

The Configurator v2.9 has been updated to use this version.

If you are already using the Configurator and build from source you can upgrade from MySQL Cluster 7.0.6 to 7.0.7 (and you are recommended to upgrade because of the following changes/fixes) in four steps. Here is how:

1. run the upgrade-706-to-707-src.sh script (put it in install/ and chmod u+x ./upgrade-706-to-707-src.sh):
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/install
[cluster01]# chmod u+x ./upgrade-706-to-707-src.sh
[cluster01]# ./upgrade-706-to-707-src.sh
Upgrading scripts
done - now run ./download-and-compile.sh
2. run the script download-and-compile.sh
[cluster01]# ./download-and-compile.sh
After some time it will finish compiling:

mysql-7.0.7-linux-x86_64/sql-bench/test-alter-table
mysql-7.0.7-linux-x86_64.tar.gz created
Removing temporary directory
Downloaded binary distribution to ../../repo
(the 'downloaded binary' should really read 'Copied binary').

3. When you get prompted with the following question answer 'y' (yes):
Do you want to install the binaries on the hosts now
(else run 'install-cluster.sh' later)? (y/n):y
Important! Do not run the the bootstrap.sh script!!
4. change directory to the scripts/ to directory and run the ./rolling-restart.sh script
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/
[cluster01]# ./rolling-restart.sh
This in an online procedure so it requires no downtime of the Cluster.
If you are using binary distributions then you have to wait, because they are not ready yet.
Good luck!

Ps - Configurator 2.9 (released 2nd of September 2009) contains some minor fixes:
  • Adding of [mysqld] slot for administration purposes
  • ./start-backup now backup the configuration files (config.ini, my.cnf and some internal files) as well.