Suddenly your application starts throwing "error 157" and performance
degrades or is non-existing. It is easy to panic then and try all sorts of
actions to get past the problem. We have seen several users doing:
- rolling restart
- stop cluster / start cluster
because they also see this in the error logs:
120828 13:15:11 [Warning] NDB: Could not acquire global schema
lock (4009)Cluster Failure
That is
not a really a nice error message. To begin with, it is a WARNING when
something is obviously wrong. IMHO, it should be CRITICAL. Secondly, the
message ‘Cluster Failure’ is misleading. The cluster may not really have failed, so there is
no point trying to restart it before we know more.
So what
does error 157 mean and what can we do about it?
By
using perror we can get a hint what it means:
$ perror
157
MySQL error code 157: Could not connect to storage engine
What this
means is simply as it says, one SQL node is not connected to the NDB storage
engine. It can be that you have messed
up the ndb-connectstring in the my.cnf of that SQL node.
It is
often the case that someone has been doing ALTER TABLEs, and for some reason,
the schema files have been corrupted or is of the wrong version on one or more SQL
nodes. It would make sense to check the SQL nodes for potential problems there.
If you
have multiple SQL nodes, you probably have the SQL nodes behind a load
balancer. Which MySQL is really causing this problem? One of them, or all of
them?
Let’s
take a look from one of the SQL nodes:
mysql> show global status like 'ndb_number_of%';
+--------------------------------+-------+
| Variable_name
| Value |
+--------------------------------+-------+
| Ndb_number_of_data_nodes | 2
|
| Ndb_number_of_ready_data_nodes | 2 |
+--------------------------------+-------+
2 rows in set (0.00 sec)
This is
perfect, we have two Data nodes , and two of them are ready. This means that
this particular SQL node is connected to the NDB storage engine. Now, let's
look at another one:
mysql> show global status like 'ndb_number_of%';
+--------------------------------+-------+
| Variable_name
| Value |
+--------------------------------+-------+
| Ndb_number_of_data_nodes | 0
|
| Ndb_number_of_ready_data_nodes | 0 |
+--------------------------------+-------+
2 rows in set (0.00 sec)
This does
not look good, this SQL node is not connected to the Data nodes. Now iterate
over the remaining 62 MySQL servers or how many you have to find the bad
apples. However, I prefer myself to spot the faulty ones like this:
|
Screenshot from Severalnines ClusterControl |
Now, we
can stop the SQL nodes that are marked as "red":
1. Stop the MySQL server:
service
mysql stop
2. Go into the database(s) you
have in the datadir of the MySQL server:
cd
/var/lib/mysql/mydatabase/
3. Remove .frm files and .ndb
files:
for x in
`ls *.ndb`; do rm `basename $x .ndb`.frm; rm `basename $x`; done
4. Start the MySQL server
service
mysql start
Is step
2) safe to do?
Yes, the
.frm files of the NDB tables are stored in the data dictionary inside the data
nodes. By removing these files, and restarting the MySQL server, the MySQL
server will fetch the missing .frm files from the data nodes, and you will end
up with fresh ones.
Hope this
helps someone,
Johan