Monday, September 24, 2012

garbd - How to avoid network partitioning or split brain in a Galera Cluster

Network partitioning (aka split brain) is something you don’t want to happen, but what is the problem really?

Sometimes, users set up a Galera Cluster (or Percona XtraDB Cluster) containing an even number
of nodes, e.g, four or six nodes, and then place half of the nodes in one data center and the other half in another data center. If the network link goes down between the two data centers, then the Galera cluster is partitioned and half of the nodes cannot communicate with the other half. Galera’s quorum algorithm
will not be able to select the ‘primary component’, since both sets have the same number of nodes. You then end up with 2 sets of nodes that are ‘non-primary’, and effectively, none of the nodes would be available for database transactions.

Below is a screenshot of ClusterControl for Galera. I have simulated a network partition by simply killing two of the nodes at the same time.

Network partitioning / split brain
Half of the nodes got disconnected from the other nodes. The nodes staying up cannot determine the state of the other nodes, e.g, if they are really down (dead process), or if it was just the network link that went down and the processes are still up. Continuing then would be unsafe, and the two separated halves could potentially drift away, and you would have an inconsistent database.

At this point, our 2 nodes are pretty useless; at least one node should be in state ‘synced’ before it can be accessed. ClusterControl will address this situation by first recovering the first two nodes in ‘non-Primary’ state and setting them to Primary. It will then resync the two crashed nodes when they are back.

In order to avoid this, you can install an arbitrator (garbd). garbd is a stateless daemon which acts as a lightweight group member. In a quorum situation where half of the nodes are not available, garbd will help avoid a split-brain.

In ClusterControl, garbd is a managed process (will be automatically restarted  in case of failure) and can easily be installed from the deployment package:


Now that we have installed garbd in our 4-node test cluster, it will now consist of 1+4 nodes. Killing 2 nodes will not affect the whole cluster, as the remaining nodes will have the majority, see the picture below The minority will be excluded from the cluster, and when they are up and ready to rejoin the cluster, they will go through a recovery protocol before being accepted into the cluster.

Majority win - garbd + two survivors > two failed nodes
Hopefully this gives an idea how and why garbd comes in handy.


Hector said...

Thanks for your great blog post, i just have a question: Where do you install garbd? Datacenter 1, Datacenter 2 or in a 3rd Datacenter?

Johan Andersson said...

Hi Hector,

It depends on...

There is a majority-win rule.
Really, you should put it on a site where no Galera node runs, but that may be hard, so really on DC4, if you have one galera node already on each of DC1, DC2, DC3.

However, many of our users and customers have one data center that they are physically close to, then that is kind of their "primary" data center, and they then put X garbds on that "primary" datacenter, so that if DC2 _and_ DC3 goes, then DC1 has the majority of nodes. The most common problem is in most cases that the network connectivity goes down between the data centers, and rarely that two DCs (in a three DC setup), breaks down due to power failure at the same time.

I hope this helps a little bit.