Thursday, September 03, 2009

MySQL Cluster on two hosts - options and implications

Considering that you have the following hosts and MySQL Cluster deployed in the following ways:

A: ndb_mgmd, ndbd , mysqld
B: ndb_mgmd, ndbd , mysqld

Now you have a couple of options on how to handle potential network partitioning/split brain which could occur if the computer A and B lose contact with each other. Let's see the implications of these options.

Option 1: ArbitrationRank=1, no NIC bonding
  • If ndb_mgmd on A is elected as the arbitrator and host A crashes, then data node on host B will also die (cannot reach arbitrator).
  • If ndb_mgmd on A is elected as the arbitrator and host B lose contact with host A, then the data node on B will shut down since it can't reach the arbitrator. For further discussions, call this situation X.
  • If you are in situation X and if you restart data node on B and you have not fixed the link), then it will start isolated (after some timeouts have kicked in, StartPartionedTimeout and StartPartialTimeout). Now you are in situation Y.
  • If you had an application writing data (and assume this application crashed when the connection from B -> A was lost) then the following can happen if you are in situation Y
  • If you do "select count(*) from t1" on mysqld host A, then it can differ from a "select count(*) from t1" on mysqld on host B, because the data node on B never got the message to flush the redo buffer to the redo log.
    Now you have ended up in a very bad situation. Kill data node on host B. Fix network, restart data node with --initial.
  • However, this could be combined with STONITH to kill off one of the data nodes.
Option 2: ArbitrationRank=0, no NIC bonding
  • no arbitrators are used - no protection from split brain.
  • if host A lose contact with host B, then the data nodes will be split
  • Application can write to A and B, but e.g A will _never_ see the changes from B, and vice versa
  • The data nodes will drift apart and be out of sync (inconsistent). Very bad.
  • When the link comes up again the data nodes A and B will not reconnect to each other.
  • System administrator must kill one of the data nodes and restart it with --initial.

Option 3: ArbitrationRank=0, NIC bonding + redundant switch.
  • no arbitrators are used - no protection from split brain.
  • Same problems as Option 2) if a) both network links are broken or b) both switches crashes. However, the likelihood of this happening is very small.
Option 4: Aribtration=WaitExternal (new in MySQL Cluster 7.0.7)
  • External arbitrator is used: You, a sys admin, a dba, or a process.
  • When the data nodes lose contact with each other, they are read-only for ArbitrationDelay before unlocking (and becoming writable).
  • This means that you have ArbitrationDelay (configurable) time to decide which data node that should live and who should die.
  • The data node that you killed can then be restarted with "ndbd", i.e, without --initial.
  • "You" in this case can be an external program, sysadmin etc..
  • If you use an external program must run on a third computer.
Option 5: ArbitrationRank=1 and have ndb_mgmd on remote site
  • If you use asynchronous replication between SITE_A and SITE_B then it could be possible to have the management server there with ArbitrationRank=1, and then potentially local management servers on host A and B that only has configuration data (i.e, ArbitrationRank=0 for those).
  • This must be tested to see if the link between SITE_A and SITE_B is "good enough". The link should also be reliable.
Option 6: ArbitrationRank=1 and have a third host (called C)
  • This is the best and recommended approach (and best if even have a computer D that is like C for redundancy matters).
  • Run the management server on host C. Setup would look like:
    A: ndbd + mysqld
    B: ndbd + mysqld
    C: ndb_mgmd (AribtrationRank=1) + mysqld
And the winner is...

What is the "best" option:
  1. Option 6) - get a third computer - simple
  2. Option 3), Option 4) - "you do it, but be careful"
  3. Option 5) - must be tested
  4. Option 1) (atleast the cluster will crash avoiding inconsistencies)
  5. Option 2)
I am arguing with myself whether or not Option 1) should be on either 2nd or 3rd place, atleast it is simple and easy to understand.

Of course redundant NICs (bonding) and redundant switches can be used in all of the options above, even it is not explicitly said so for some of them.

No comments: