Wednesday, August 29, 2012

MySQL Cluster: Troubleshooting Error 157 / 4009 Cluster Failure


Suddenly your application starts throwing "error 157" and performance degrades or is non-existing. It is easy to panic then and try all sorts of actions to get past the problem. We have seen several users doing:
  • rolling restart
  • stop cluster / start cluster
because they also see this in the error logs:
120828 13:15:11 [Warning] NDB: Could not acquire global schema lock (4009)Cluster Failure

That is not a really a nice error message. To begin with, it is a WARNING when something is obviously wrong. IMHO, it should be CRITICAL. Secondly, the message ‘Cluster Failure’ is misleading.  The cluster may not really have failed, so there is no point trying to restart it before we know more.

So what does error 157 mean and what can we do about it?

By using perror we can get a hint what it means:

$ perror 157
MySQL error code 157: Could not connect to storage engine

What this means is simply as it says, one SQL node is not connected to the NDB storage engine.  It can be that you have messed up the ndb-connectstring in the my.cnf of that SQL node.
It is often the case that someone has been doing ALTER TABLEs, and for some reason, the schema files have been corrupted or is of the wrong version on one or more SQL nodes. It would make sense to check the SQL nodes for potential problems there.

If you have multiple SQL nodes, you probably have the SQL nodes behind a load balancer. Which MySQL is really causing this problem? One of them, or all of them?
Let’s take a look from one of the SQL nodes:

mysql> show global status like 'ndb_number_of%';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| Ndb_number_of_data_nodes       | 2     |
| Ndb_number_of_ready_data_nodes | 2     |
+--------------------------------+-------+
2 rows in set (0.00 sec)

This is perfect, we have two Data nodes , and two of them are ready. This means that this particular SQL node is connected to the NDB storage engine. Now, let's look at another one:

mysql> show global status like 'ndb_number_of%';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| Ndb_number_of_data_nodes       | 0     |
| Ndb_number_of_ready_data_nodes | 0     |
+--------------------------------+-------+
2 rows in set (0.00 sec)

This does not look good, this SQL node is not connected to the Data nodes. Now iterate over the remaining 62 MySQL servers or how many you have to find the bad apples. However, I prefer myself to spot the faulty ones like this:

Screenshot from Severalnines ClusterControl

Now, we can stop the SQL nodes that are marked as "red":
1.     Stop the MySQL server:
service mysql stop
2.    Go into the database(s) you have in the datadir of the MySQL server:
cd /var/lib/mysql/mydatabase/
3.    Remove .frm files and .ndb files:
for x in `ls *.ndb`; do rm `basename $x .ndb`.frm; rm `basename $x`;  done
4.    Start the MySQL server
service mysql start
Is step 2) safe to do? 
Yes, the .frm files of the NDB tables are stored in the data dictionary inside the data nodes. By removing these files, and restarting the MySQL server, the MySQL server will fetch the missing .frm files from the data nodes, and you will end up with fresh ones.

Hope this helps someone,
Johan





No comments: