Thursday, December 31, 2009

Cluster Monitoring

A new version of CMON - the monitor for MySQL Cluster - has been released. Many thanks to everyone who helped me with bug fixing and beta testing!

CMON monitors all events and information that you can get from MySQL Cluster.

Some of the things that CMON monitors:
  • Cluster status (starting, started, stopped, recovering, degraded etc)
  • Data node status (started, stopped, starting etc)
  • Application (mysqld) node status (connected/disconnected)
  • DataMemory/IndexMemory/Tablespace utilization
  • Statistics from data nodes
  • Statistics from mysql servers
  • Replication (basic currenty) support -picks up if a mysql server is a slave and checks if is running or not
  • Alarm generation
There are pre-built binaries for Linux 64-bit (I have not got around yet to do debian and rpm packages), as well as source code. It has been tested in Fedora 9 , Ubuntu 9.04, Centos 4/5, but I have not had the possibility to test on Redhat/SLES/Debian yet, so it would be interesting to know if there are any issues with it there. There are also installation instructions and a deployment guide (PDF).

CMON uses the Management API to collect information from MySQL Cluster and puts this information in a MySQL database, so it can be queried using SQL. This makes it very easy to write scripts around it and there are some sample scripts on the CMON web page.
# Get the status of Cluster for cluster with clusterid=1:
root@centos4-1 bin]# ./get_cluster_status.sh --clusterid=1
STARTED

# Get the list of node ids for type NDBD so that we can iterate over them easily.
[root@centos4-1 bin]# ./get_node_ids.sh --clusterid=1 -t NDBD
3 4

# Get the status of the node 3
[root@centos4-1 bin]# ./get_node_status.sh --clusterid=1 --nodeid=3
STARTED
If you want you can also view the Cluster from a web interface (requires PHP), and you can get graphs using RRD.

Installation is very simple, just run the cmon_install.sh script that is part of the package which will ask you a number of questions and write a configuration file.

Friday, November 27, 2009

Node failure handling - take 2

Jonas improved the node failure handling in MySQL Cluster 7.0.9 (and 6.3.29) so here comes a re-run of the a previous blog post. And the node failure handling time has improved a lot - see below.

I created in total 11345 tables, each with 128 columns, and then hit this bug.

When all tables were created, I stopped one node and measured how long time it takes for the other node to perform the node failure handling. Here is what was written into the cluster log (look at the bold lines):

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Node 3 Disconnected

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Network partitioning - arbitration required

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: President restarts arbitration thread [state=7]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: Communication to Node 3 closed

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 1

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: GCP Take over started

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: Node 4 taking over as DICT master

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: GCP Take over completed

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: kk: 9670/4 0 0

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: LCP Take over started

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: LCP Take over completed (state = 5)

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 3: Node shutdown completed.

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected

2009-11-27 13:39:22 [MgmtSrvr] INFO -- Node 4: Started arbitrator node 1 [ticket=2a74000295047d39]

2009-11-27 13:40:20 [MgmtSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 1 min. - state = 6

2009-11-27 13:41:03 [MgmtSrvr] INFO -- Node 4: Communication to Node 3 opened

Now, the node failure handling was completed in 1 minute 42 seconds! This is quite some improvement from the 17 minutes (although measured with 16000 tables and I will comeback when the bug mentioned above is fixed) measured before Jonas made his magic.

What does it mean? It means that we can start to recover a failed data nodes much earlier now!

Thursday, November 26, 2009

Scaling - going from 2 to X data nodes

When moving from two data nodes to a bigger Cluster it is not necessarily true that you will have better performance. In fact you can get worse.

Here are some things to think about:
  • Database Load (traffic to Cluster) - if you can handle query load X on a two node cluster and move the same load X to a four data node cluster you will likely get new_load_X=0.8X, i.e., a performance degradation. This has to do with 1) buffers are not filled up fast enough so the data nodes will do "timeout" based sending or 2) that the access patterns aren't scaling. To correct 1) you need to increase the load on the cluster so that internal communication buffers fill up faster.
Access pattern related "problems":

  • For primary key operations (reads, updates, deletes) you will always go to the correct node to fetch the data with PK operations, no matter how many nodes you have. So no problems with these types of ops.

  • For index scans on a key you will always scan all data nodes. This can be expensive if the number of records you retrieve is "small" (about <=64 records), since setting up the index scan on every data node is expensive. Sending the actual data is relatively cheap. This can be the cause of performance degradation when moving from 2 data nodes to bigger cluster. This is not so much to do about, unless you can redesign your schema for most important requests to fit the index scans on part of PK way of accessing data.

  • For index scans on part of PK there are two options - index scan all data nodes or index scan only one data node. This can also be the cause of performance degradation when moving from 2 data nodes to bigger cluster.

Example with index scans on part of PK:
CREATE TABLE t1 (
userid integer,
serviceid integer,
data varbinary(512),
INDEX (serviceid),
PRIMARY KEY(userid, serviceid)
);
  • This table is by default partitioned on the primary key (every table in cluster is by default partitioned on the primary key). This means in a four data node cluster that you can have your user spread out on 4 different partitions.
  • You may have this data
<userid,serviceid,>
<1,1,aaa>
<1,2,aaa>
<1,3,aaa>
<1,4,aaa>

The record <1,1,aaa> may be on one data node. <1,2,aaa> may be on another and so on. It is random.

If you now do a SELECT * FROM t1 WHERE userid=<value>
then every data node will be scanned for data, the result then merged in the NDBAPI (mysqld).

Good or bad? Depends on - if your user has >64 services (CPU and network dependent), then it is better to scan all data nodes.

If your user has <=64 services then it is better to involve only one data node.

You can then create the table as follows:
CREATE TABLE t1 (
userid integer,
serviceid integer,
data varbinary(512),
INDEX (serviceid),
PRIMARY KEY(userid, serviceid)
) PARTITION BY KEY (userid);
- You may have still have this data:
<userid,serviceid,>
<1,1,aaa>
<1,2,aaa>
<1,3,aaa>
<1,4,aaa>
All records with userid=1 is now located in one data node (one partition) no matter how many data nodes you have!

Now, the primary key is still <userid,> but the data is put on the partitions by looking at <userid> only! This ensures that a user will be on only one partition, irrespectively how many services the user has. If you query on <userid> then only one partition will be involved in the search, since for a particular <userid> in the table t1, all data for that user is in only one partition.

If you now do a SELECT * FROM t1 WHERE userid=<value> then only one data node will be scanned.

For completeness of the example: SELECT * FROM t1 WHERE serviceid=<value> will cause all data nodes to perform an index scan on the index (serviceid), since the data is not partitioned by serviceid, and not the leftmost column in the index (the indexed columns are used left to right).

The main point is that the fixed cost of setting up an ordered index scan is expensive, but then to retrieve the records is cheap. So if you return many rows, read from all data nodes.

If you later on plan to increase the number of data nodes from 4 to 8 nodes. Then using PARTITION BY KEY when possible makes your upgrades (more data nodes) easier and less susceptible to performance decreases.

This is how it looks on my system (as you can see partitioned searches (using PARTITION BY KEY) is faster up to ~ 64 records).

MySQL Cluster 7.0.9b in Configurator and Sandbox

Configurator and Sandbox scripts are updated to use 7.0.9b.

You are recommended to upgrade!

Upgrade can be performend using these scripts:
Binary distribution - Upgrade script from 7.0.* -> 7.0.9b is here.
Source distribution - Upgrade script from 7.0.* -> 7.0.9b is here.
  1. Copy the scripts to "install/"
  2. chmod u+x upgrade-script.sh
  3. ./upgrade-script.sh
After step 3) you need to run either:
  • download-binary.sh
  • download-and-compile.sh
and then
  • rolling-restart.sh

Friday, October 23, 2009

Cluster Performance Tuning

The 3rd of November I will have a webinar on Cluster Performance Tuning.

For more information:

http://www.mysql.com/news-and-events/web-seminars/display-453.html

In addition to what is mentioned in the Agenda I will also present numbers on typical operations, how to design requests, explain how the optimizer works with MySQL Cluster etc etc.

See you there!

-johan

Wednesday, October 07, 2009

MySQL Cluster 7.0.8a and 6.3.27a released (source)

Fixes regression - bug 47844.

Configurator and Sandbox scripts are updated to use this version.

If you build from source then you can upgrade using these scripts:
Upgrade script from 7.0.7 -> 7.0.8a is here.
Upgrade script from 7.0.8 -> 7.0.8a is here.
Upgrade script from 6.3.26-> 6.3.27a is here.

Follow the instructions in this blog post!

Tuesday, October 06, 2009

MySQL Cluster 7.0.8/6.3.27 - serious regression!!!

A regression in MySQL Cluster 7.0.8 and 6.3.27 has caused them to be removed from download.

Problem:
If you have a mysql server with higher node id than 49, then it will cause cluster to crash.

An updated version of MySQL Cluster 7.0.8, called 7.0.8a will be released very soon.

The Configurator has been reverted back to use 7.0.7 and 6.3.26.

Friday, October 02, 2009

Configuring for large databases in MySQL Cluster

If you need to create a big database into MySQL Cluster with:
  • A lot of tables indexes, columns, and tables
  • A lot of records
there are a few things to think about:
  • If a table has > ~90M records, you have to create the table with MAX_ROWS=<amount of records in table anticipating growth>:
    CREATE TABLE t1(...) ENGINE=ndbcluster MAX_ROWS=200000000;
    This way the data node will allocate more partitions for the table, since there is a limitation in how many records that can be stored in one partition (and the limit is around 90M records).
  • Many tables / table objects --> Make sure you increase MaxNoOfTables (kernel limit is 20320 tables). This creates a table object pool of size MaxNoOfTables.
    • Each table you create will use one table object.
    • Each unique index you create will use one table object
    • Each ordered index you create will use one table object
    • Each BLOB/TEXT attribute will use one table object.

    For example if you create the following table:
    CREATE TABLE `t1` (
    `a` int(11) NOT NULL DEFAULT '0',
    `b` char(32) DEFAULT NULL,
    `c` blob,
    `d` bigint(20) DEFAULT NULL,
    PRIMARY KEY (`a`),
    UNIQUE KEY `b` (`b`),
    KEY `d` (`d`)
    ) ENGINE=ndbcluster;

    The table will require table objects as follows:
    • PRIMARY KEY (there is an ordered index on the PK): 1 table objects
    • UNIQUE KEY (unique hash index + ordered index): 1 + 1 table objects
    • BLOB: 1 table object
    • Actual Table: 1 table object
    • IN TOTAL = 5 table objects

    If you had specified PRIMARY KEY USING HASH, and UNIQUE KEY USING HASH, then you can save two table objects, one for each ordered index.
  • Many UNIQUE indexes --> Increase MaxNoOfUniqueHashIndexes
  • Many ordered indexes --> Increase MaxNoOfOrderedIndexes
  • Many attributes - -> Increase MaxNoOfAttributes (should be sum of number of columns all tables and add 3-10% for spare).
Also you would probably want to increase:
  • DataMemory
  • IndexMemory
If you have a lot of BLOB/TEXT columns or have many client apps/users connecting to one MySQL Server:
  • SendBufferMemory=8M
  • ReceiveBufferMemory=8M
If you get "error 306" then you should increase
  • StringMemory=25 to StringMemory=50
  • IndexMemory=
(Note with 7.0.8 i can create 16000+ tables with 128 columns in each with StringMemory=25).

This page will be updated if I find other things.

If you still have problems - let us/me know, e.g, on the cluster mailing list!

The Configurator has these parameters specified already in the config.ini so it is a small thing to change (just run the ./rolling-restart.sh script).

Wednesday, September 30, 2009

Upgrade to 7.0.8 (with the Configurator)

MySQL Cluster 7.0.8 was released as a source distribution 30th of Sept 2009. You should upgrade if you can build from source or use the Configurator.

See below how to upgrade when using the Configurator.

7.0.8 contains a number of fixes in almost every area and you are recommended to upgrade if you experience problems with earlier version(s). Especially if you have encountered the "error 306" problem.

The Configurator v2.9 has been updated to use this version.

If you are already using the Configurator and build from source you can upgrade from MySQL Cluster 7.0.7 to 7.0.8 (and you are recommended to upgrade because of the following changes/fixes) in four steps. Here is how:

1. run the upgrade-707-to-708-src.sh script (put it in install/ and chmod u+x ./upgrade-707-to-708-src.sh):
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/install
[cluster01]# chmod u+x ./upgrade-707-to-708-src.sh
[cluster01]# ./upgrade-707-to-708-src.sh
Upgrading scripts
done - now run ./download-and-compile.sh
2. run the script download-and-compile.sh
[cluster01]# ./download-and-compile.sh
After some time it will finish compiling:

mysql-7.0.8-linux-x86_64/sql-bench/test-alter-table
mysql-7.0.8-linux-x86_64.tar.gz created
Removing temporary directory
Downloaded binary distribution to ../../repo
(the 'downloaded binary' should really read 'Copied binary').

3. When you get prompted with the following question answer 'y' (yes):
Do you want to install the binaries on the hosts now
(else run 'install-cluster.sh' later)? (y/n):y
Important! Do not run the the bootstrap.sh script!!
4. change directory to the scripts/ to directory and run the ./rolling-restart.sh script
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/
[cluster01]# ./rolling-restart.sh
This in an online procedure so it requires no downtime of the Cluster.
If you are using binary distributions then you have to wait, because they are not ready yet.
Good luck!

Ps - Configurator 2.9 (released 2nd of September 2009) contains some minor fixes:
  • Adding of [mysqld] slot for administration purposes
  • ./start-backup now backup the configuration files (config.ini, my.cnf and some internal files) as well.

Wednesday, September 16, 2009

Node failure handling

When a data node fails, the other data nodes must perform clean-up of the failed node.

E.g, to abort transactions in the PREPARE phase (Cluster uses a variant of the Two Phase Commit protocol) that has been started on the Data node that fails.
Various protocols (if the failed data node is the master) must also fail-over.

These things are carried out quite swiftly (configurable, but a couple of seconds by default but it also depends on the size of transactions that in the prepare phase that needs to be aborted).

I wanted to check how long time it takes to do node failure handling with many tables (16000+) in the cluster, because in a "normal" database with a couple of hundred tables, then the time spent on node failure handling is not a factor.

In this case I have two data nodes:
  • Node 3 (master)
  • Node 4
I have created 16000+ tables each with 128 columns just to see what happens with failure handling and recovery.

Node 3 was killed, and the Node 4 then has to do node failure handling:
2009-09-16 14:23:41 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 15 min. - state = 6

2009-09-16 14:24:41 [MgmSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 16 min. - state = 6

2009-09-16 14:25:41 [MgmSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 17 min. - state = 6

2009-09-16 14:26:13 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened
Eventually, we get the message :
2009-09-16 14:26:13 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
We cannot start to recover Node 3 until this message has been written out in the cluster log!

If you try to connect the data node before this, then the management server will say something like "No free slot found for node 3", which basically means in this case that the node failure handling is ongoing.

Now we have got the message and we can now start data node 3 again and it will recover

Why is it taking long time:
  • The surviving data nodes that are handling the failures must write for each table that node three has crashed. This is written to disk. This is what takes time when you have many tables (about 17 minutes with >16000 tables) so about 1000 tables per minute.. So in most databases it does not take that much time. Thanks Jonas for you explanation.
However, it is important for you to understand that during this time it is STILL POSSIBLE to WRITE / READ to the Cluster, i.e., the database is online.

Thursday, September 03, 2009

MySQL Cluster on two hosts - options and implications

Considering that you have the following hosts and MySQL Cluster deployed in the following ways:

A: ndb_mgmd, ndbd , mysqld
B: ndb_mgmd, ndbd , mysqld

Now you have a couple of options on how to handle potential network partitioning/split brain which could occur if the computer A and B lose contact with each other. Let's see the implications of these options.

Option 1: ArbitrationRank=1, no NIC bonding
  • If ndb_mgmd on A is elected as the arbitrator and host A crashes, then data node on host B will also die (cannot reach arbitrator).
  • If ndb_mgmd on A is elected as the arbitrator and host B lose contact with host A, then the data node on B will shut down since it can't reach the arbitrator. For further discussions, call this situation X.
  • If you are in situation X and if you restart data node on B and you have not fixed the link), then it will start isolated (after some timeouts have kicked in, StartPartionedTimeout and StartPartialTimeout). Now you are in situation Y.
  • If you had an application writing data (and assume this application crashed when the connection from B -> A was lost) then the following can happen if you are in situation Y
  • If you do "select count(*) from t1" on mysqld host A, then it can differ from a "select count(*) from t1" on mysqld on host B, because the data node on B never got the message to flush the redo buffer to the redo log.
    Now you have ended up in a very bad situation. Kill data node on host B. Fix network, restart data node with --initial.
  • However, this could be combined with STONITH to kill off one of the data nodes.
Option 2: ArbitrationRank=0, no NIC bonding
  • no arbitrators are used - no protection from split brain.
  • if host A lose contact with host B, then the data nodes will be split
  • Application can write to A and B, but e.g A will _never_ see the changes from B, and vice versa
  • The data nodes will drift apart and be out of sync (inconsistent). Very bad.
  • When the link comes up again the data nodes A and B will not reconnect to each other.
  • System administrator must kill one of the data nodes and restart it with --initial.

Option 3: ArbitrationRank=0, NIC bonding + redundant switch.
  • no arbitrators are used - no protection from split brain.
  • Same problems as Option 2) if a) both network links are broken or b) both switches crashes. However, the likelihood of this happening is very small.
Option 4: Aribtration=WaitExternal (new in MySQL Cluster 7.0.7)
  • External arbitrator is used: You, a sys admin, a dba, or a process.
  • When the data nodes lose contact with each other, they are read-only for ArbitrationDelay before unlocking (and becoming writable).
  • This means that you have ArbitrationDelay (configurable) time to decide which data node that should live and who should die.
  • The data node that you killed can then be restarted with "ndbd", i.e, without --initial.
  • "You" in this case can be an external program, sysadmin etc..
  • If you use an external program must run on a third computer.
Option 5: ArbitrationRank=1 and have ndb_mgmd on remote site
  • If you use asynchronous replication between SITE_A and SITE_B then it could be possible to have the management server there with ArbitrationRank=1, and then potentially local management servers on host A and B that only has configuration data (i.e, ArbitrationRank=0 for those).
  • This must be tested to see if the link between SITE_A and SITE_B is "good enough". The link should also be reliable.
Option 6: ArbitrationRank=1 and have a third host (called C)
  • This is the best and recommended approach (and best if even have a computer D that is like C for redundancy matters).
  • Run the management server on host C. Setup would look like:
    A: ndbd + mysqld
    B: ndbd + mysqld
    C: ndb_mgmd (AribtrationRank=1) + mysqld
And the winner is...

What is the "best" option:
  1. Option 6) - get a third computer - simple
  2. Option 3), Option 4) - "you do it, but be careful"
  3. Option 5) - must be tested
  4. Option 1) (atleast the cluster will crash avoiding inconsistencies)
  5. Option 2)
I am arguing with myself whether or not Option 1) should be on either 2nd or 3rd place, atleast it is simple and easy to understand.

Of course redundant NICs (bonding) and redundant switches can be used in all of the options above, even it is not explicitly said so for some of them.

Upgrade to 7.0.7 (with the Configurator)

MySQL Cluster 7.0.7 was released as a source distribution 1st of Sept 2009. You should upgrade if you can build from source or use the Configurator.

The Configurator v2.9 has been updated to use this version.

If you are already using the Configurator and build from source you can upgrade from MySQL Cluster 7.0.6 to 7.0.7 (and you are recommended to upgrade because of the following changes/fixes) in four steps. Here is how:

1. run the upgrade-706-to-707-src.sh script (put it in install/ and chmod u+x ./upgrade-706-to-707-src.sh):
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/install
[cluster01]# chmod u+x ./upgrade-706-to-707-src.sh
[cluster01]# ./upgrade-706-to-707-src.sh
Upgrading scripts
done - now run ./download-and-compile.sh
2. run the script download-and-compile.sh
[cluster01]# ./download-and-compile.sh
After some time it will finish compiling:

mysql-7.0.7-linux-x86_64/sql-bench/test-alter-table
mysql-7.0.7-linux-x86_64.tar.gz created
Removing temporary directory
Downloaded binary distribution to ../../repo
(the 'downloaded binary' should really read 'Copied binary').

3. When you get prompted with the following question answer 'y' (yes):
Do you want to install the binaries on the hosts now
(else run 'install-cluster.sh' later)? (y/n):y
Important! Do not run the the bootstrap.sh script!!
4. change directory to the scripts/ to directory and run the ./rolling-restart.sh script
[cluster01]# pwd
/root/mysqlcluster-70-master/cluster/scripts/
[cluster01]# ./rolling-restart.sh
This in an online procedure so it requires no downtime of the Cluster.
If you are using binary distributions then you have to wait, because they are not ready yet.
Good luck!

Ps - Configurator 2.9 (released 2nd of September 2009) contains some minor fixes:
  • Adding of [mysqld] slot for administration purposes
  • ./start-backup now backup the configuration files (config.ini, my.cnf and some internal files) as well.

Monday, August 31, 2009

Data node stuck in phase 101 - what to do?

Sometimes I have seen and heard about that a data node gets stuck in start phase 101.

Unfortunately it is difficult to reproduce this (found no way yet), so no bug fix is in the pipe yet.

What happens is that in sp 101, the starting data node (actually a block called SUMA) should reconnect to the mysql server and take over the event handling (sending events to the mysql server), but it never gets the reconnect to the mysql server(s). A better explanation is here :)

If you see your data node stuck here then try the following:
1) Restart the mysql servers (one by one), the data node should now start
or
2) Restart the mysql servers and restart the data node.

Only do 2) if 1) does not work.

And if you know how to reproduce - let us know!

Tuesday, August 04, 2009

OPTIMIZE TABLE on Cluster (revisited)

Jonas just wrote a patch to this bug on OPTIMIZE TABLE, and the issue that was also discussed in this blog post. Jonas also fixed this bug when he was at it.

Before, OPTIMIZE TABLE hardly freed up any pages and to defragment you had to do a rolling restart of the data nodes.

Now, there is only a 2% discrepancy between OPTIMIZE TABLE and doing a rolling restart. This is great stuff.
This will fix will make it into 6.3.26 and 7.0.7.

See below for details:

Creating two tables, t5 and t6:

CREATE TABLE `t5` (
`id` varchar(32) CHARACTER SET utf8 NOT NULL,
`name` varchar(32) NOT NULL DEFAULT '',
`unit` int(11) DEFAULT NULL,
PRIMARY KEY (`id`,`name`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1

And then the following table:

'CREATE TABLE `t6` ( `id` varchar(32) CHARACTER SET utf8 NOT NULL, `name` varchar(32) NOT NULL DEFAULT '', c0 int unsigned not null, c1 int unsigned not null, data2 varchar(255), `unit` int(11) DEFAULT NULL, PRIMARY KEY (`id`,`name`) ) ENGINE=ndbcluster DEFAULT CHARSET=latin1;


Then we use hugoLoad (a test program that isn't built by default in the src distribution of cluster) to load it with 1M records.

./hugoLoad -r 1000000 -d test t5
./hugoLoad -r 1000000 -d test t6


Memory usage after population:

Node 2: Data usage is 45%(14890 32K pages of total 32768)
Node 2: Index usage is 22%(3722 8K pages of total 16416)
Node 3: Data usage is 45%(14890 32K pages of total 32768)
Node 3: Index usage is 22%(3722 8K pages of total 16416)


Delete 500K records from t5.


mysql> set ndb_use_transactions=0;
Query OK, 0 rows affected (0.00 sec)

mysql> delete from t5 limit 100000;
Query OK, 100000 rows affected (2.73 sec)

mysql> delete from t5 limit 100000;
Query OK, 100000 rows affected (3.13 sec)

mysql> delete from t5 limit 100000;
Query OK, 100000 rows affected (3.11 sec)

mysql> delete from t5 limit 100000;
Query OK, 100000 rows affected (3.13 sec)

mysql> delete from t5 limit 100000;
Query OK, 100000 rows affected (2.97 sec)


Memory usage after deletion:


Node 2: Data usage is 45%(14890 32K pages of total 32768)
Node 2: Index usage is 19%(3120 8K pages of total 16416)
ndb_mgm> Node 3: Data usage is 45%(14890 32K pages of total 32768)
Node 3: Index usage is 19%(3121 8K pages of total 16416)


Run OPTIMIZE TABLE:


mysql> optimize table t5;
+---------+----------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------+----------+----------+----------+
| test.t5 | optimize | status | OK |
+---------+----------+----------+----------+
1 row in set (1 min 41.94 sec)



CPU UTIL during OPTIMIZE TABLE:
User:23.8% Sys: 4.9% ndbd

Memory usage after OPTIMIZE TABLE:


ndb_mgm> Node 2: Data usage is 39%(13090 32K pages of total 32768)
Node 2: Index usage is 19%(3120 8K pages of total 16416)
Node 3: Data usage is 39%(13090 32K pages of total 32768)
Node 3: Index usage is 19%(3121 8K pages of total 16416)


Delete 500K (50%) of the records from t6:

mysql> set ndb_use_transactions=0;
Query OK, 0 rows affected (0.00 sec)

mysql> delete from t6 limit 100000;
Query OK, 100000 rows affected (2.77 sec)

mysql> delete from t6 limit 100000;
Query OK, 100000 rows affected (3.05 sec)

mysql> delete from t6 limit 100000;
Query OK, 100000 rows affected (3.12 sec)

mysql> delete from t6 limit 100000;
Query OK, 100000 rows affected (3.19 sec)

mysql> delete from t6 limit 100000;
Query OK, 100000 rows affected (3.24 sec)


Memory usage after deletion:


Node 2: Data usage is 39%(13090 32K pages of total 32768)
Node 2: Index usage is 15%(2521 8K pages of total 16416)
Node 3: Data usage is 39%(13090 32K pages of total 32768)
Node 3: Index usage is 15%(2522 8K pages of total 16416)


Run OPTIMIZE TABLE:


mysql> optimize table t6;
+---------+----------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------+----------+----------+----------+
| test.t6 | optimize | status | OK |
+---------+----------+----------+----------+
1 row in set (1 min 45.83 sec)


Memory usage after OPTIMIZE TABLE:


Node 2: Data usage is 28%(9249 32K pages of total 32768)
Node 2: Index usage is 15%(2521 8K pages of total 16416)
Node 3: Data usage is 28%(9249 32K pages of total 32768)
Node 3: Index usage is 15%(2522 8K pages of total 16416)


After restart of node 3:


Node 2: Data usage is 28%(9249 32K pages of total 32768)
Node 2: Index usage is 15%(2521 8K pages of total 16416)
Node 3: Data usage is 26%(8568 32K pages of total 32768)
Node 3: Index usage is 13%(2236 8K pages of total 16416)


Only two percent diff between rolling restart and OPTIMIZE TABLE. Excellent. Thank you Jonas for fixing this.

Upgrade 6.3.x to 7.0.6

Currently a number of users has reported upgrade issues on 6.3.x to 7.0.6 (below is an example what might happen to you).

Thus, we don't recommend you to upgrade from 6.3.x to 7.0.6 in a production system yet.
However, we are very interested in bug reports and and help with testing.
We are also of course working on fixing these issues and will keep you posted when things has improved. Sorry for the inconvenience.

Currently the "best practice" to upgrade is to:
1) backup / mysqldump
2) start cluster 7.0.6 with initial so it is completely blank
3) restore backup / load dump


ERROR 1005 (HY000): Can't create table 'test.#sql-3d25_aa905' (Errno: 140)

mysql> show warnings;
+-------+------+--------------------------------------------------------------------------+
| Level | Code | Message |
+-------+------+--------------------------------------------------------------------------+
| Error | 1296 | Error 755 'Invalid tablespace' from NDB |
| Error | 1005 | Can't create table 'test.#sql-3d25_aa905' (Errno: 140) |
+-------+------+--------------------------------------------------------------------------+

Monday, August 03, 2009

Cluster + Dolphin DX = Sweet!

At a customer I have had the opportunity to work with some really state of the art equipment.

The customer is in the online gaming industry and we are building a platform based on MySQL Cluster where we are consolidating a number of applications. Workload is mainly inserts, updates, pk reads, and index scans (quite frequent), and a few joins. The cluster handles the transient data (current working set), and is replicating to a long term data store (following this approach), and we are also feeding other back-end support systems from this.

The cluster (>8 computers) is interconnected with Dolphin DX adapters. Computers are SUNs Nehalem based servers (Intel XEON E5540). The cluster (7.0.6) consists of 4 data nodes each having 32GB of RAM, plenty of RAID 1+0 storage, and >4 mysql servers. All running on the Nehalems.. It is so sweet :)

So much did things improve with the DX adapters over Gig-E? Just by plugging it in we got an overall 2.5-3x performance improvement (throughput up, response times down). I love it. Now we will continue optimizing the setup, schemas, queries etc to really find out.

I can really recommend the investment in DX adapters (remote memory access is faster than local disk) if you are after the top-notch performance , and the company behind it, Dolphin ICS is really fantastic to work with. If the DX switch would fail, then it will automatically fall back on the Gig-E adapters, transparently, and no service interruption (you or the applications don't notice a thing, except that performance drops).

Note: Dolphin ICS' adapters offers a socket interface. Thus, any TCP/IP application can benefit from the low latency and high bandwidth.

Thursday, July 30, 2009

ndb_restore - best practice

A really simple best practice:

Make sure there are NO mysql servers connected to the Cluster while you do the restore!
It can cause the restore to fail badly and potentially even your cluster.

Also, make sure (if you use disk data) to remove all data files in the datadirs of the data nodes as they are not removed when doing an --initial ... I have written a bug report about that...

(i will update the severalnines/config scipts to include this best practice!)

Wednesday, July 01, 2009

Problems with .FRM files, auto-discovery and MySQL Cluster

There are some bug reports on the auto-discovery protocol in MySQL Cluster.
The idea of the auto-discovery protocol is to fetch the .frm files for the NDB tables stored in the data dictionary of the data nodes, and put them in the data directory of the mysql server.

However, sometimes (not always, which makes it more difficult to reproduce and hence fix), the auto-discovery seems to make strange things (from this bug report):
After shuting down and restoring my cluster I get the following error.

090211 9:59:26 [Note] NDB: mismatch in frm for panel.gatewayquestions, discovering...
090211 9:59:26 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$panel/gatewayquestions
090211 9:59:26 [Note] NDB Binlog: logging ./panel/gatewayquestions (UPDATED,USE_WRITE)

This is due to the files already being in the mysql data directory. After the error the
frm does not match the data in memory this causes the following.
When running select count(*) from tablename;
You will get an accurate count.
When running select * from table name;
You get an error Can't find record in tablename.
I have seen it as well at some customers usually with bigger installations and many tables.

My current recommendation (work around) is to delete the FRM files associated with the NDB tables in the mysql server data directory before you start the mysql server(s).

So this is what i always include in my MySQL server startup scripts (and is included in the Configurator scripts):
files=`find $mysql_datadir  -name "*.ndb"`
for f in $files
do
x=`basename $f .ndb`
#make sure we leave out ndb_binlog_index and ndb_schema since they are myisam tables
if [ "$x" == "ndb_binlog_index" ] || [ "$x" == "ndb_schema" ] ;
then
echo "Ignoring $x"
else
y=`echo $f | sed -e 's#ndb#frm#'`
rm -rf $f
rm -rf $y
fi
done
#start the mysqld here
If I want to restore data I usually:
  1. stop the cluster
  2. start the data nodes with --initial
  3. stop the mysql servers (make sure they are not started)
  4. restore the cluster data
  5. start the mysql servers (clearing out whatever .frm files coming from the ndb tables)

Monday, June 29, 2009

How to upgrade from 6.3 to 7.0

In order to upgrade from 6.3 to 7.0 you must follow these rules:
  1. IT IS ONLY POSSIBLE TO UPGRADE FROM 6.3 to at least 7.0.6!
  2. I would recommend to upgrade from 6.3.x to 6.3.latest before doing the upgrade. If the upgrade does not work then and you have a good config.ini , you have probably hit a bug.
  3. You cannot upgrade from 6.3 to the multi-threaded binary of 7.0.6 in one go.
    You have to upgrade from 'ndbd' (6.3) --> 'ndbd' (7.0.6)
    Then you can do 'ndbd' (7.0.6) --> 'ndbmtd' (7.0.6)
So don't try to upgrade to 7.0.5 - it will fail.
As usual when upgrading (version, configuration variables etc) you must do a rolling restart and restart:
  1. ndb_mgmd (management servers)
  2. ndbd (data nodes)
  3. mysqld (mysql servers / direct api applications)
Distribute the 7.0.6 binaries
Copy the binaries to each host (i.e., replace existing 6.3 binaries) in the cluster.

Upgrading the Management Server(s)

In 6.3 you started the management server as (e.g):
ndb_mgmd -f /etc/mysql/config.ini
In 7.0.6 you have to start them slightly differently, but first you have to stop both (if you have two, but you really should) management servers (because of this bug):
killall ndb_mgmd
And.. I never use the management client to start/stop nodes, because I think it works sometimes which makes it difficult to file bugs on, and hard to rely on.

When you have killed both you can then do:
ndb_mgmd -f /etc/mysql/config.ini --configdir=/etc/mysql --reload
The management server in 7.0.6 writes a binary config file in --configdir, which is a configuration cache. --reload will reload the config.ini and update that cache.

Upgrading the Data Nodes

Restart the data nodes on each machine one at a time:
killall ndbd
ndbd --ndb-nodeid=X --ndb-connectstring="managementhostA;managementhostB"
ndb_waiter --ndb-connectstring
="managementhostA;managementhostB"
# when ndb_waiter exits, you can restart the next data node.

Upgrading the MySQL Servers/Direct APIs

Stop and start the mysql servers:
killall mysqld  #(or mysqladmin shutdown , both sends signal 15)
mysqld_safe --defaults-file=/etc/config/my.cnf
Upgrading to multithreaded 7.0.6 data nodes

When the Cluster has been restarted with 7.0.6 you can upgrade to multithreaded data nodes.
The binary for the multithreaded data node is called 'ndbmtd'.

You need to set in config.ini (on both management servers):
[ndbd default]
..
MaxNoOfExecutionThreads=#number of cores you have (up to 8)
..
Then stop both management servers
killall ndb_mgmd
When you have killed both you can then start both by doing:
ndb_mgmd -f /etc/mysql/config.ini --configdir=/etc/mysql --reload
and then restart the data nodes one at a time:
killall ndbd
ndbmtd --ndb-nodeid=X --ndb-connectstring="managementhostA;managementhostB"
ndb_waiter --ndb-connectstring
="managementhostA;managementhostB"
# when ndb_waiter exits, you can restart the next data node.
Using Severalnines Configurator

If you have a Severalnines installation of MySQL Cluster 6.3, then you can upgrade in the following way:
  1. Make sure you have have the files:
    mysqlcluster-xy/cluster/scripts/install/.s9s/hostnames
    mysqlcluster-xy/cluster/scripts/install/.s9s/ndb_hostnames
    mysqlcluster-xy/cluster/scripts/install/.s9s/mysql_hostnames
    mysqlcluster-xy/cluster/scripts/install/.s9s/mgm_hostnames
    Otherwise you have a too old version of the Configurator package and you need to generate a new one.
  2. Generate a new Config using 7.0.6 with the same nodes, data memory, data dirs etc.
    Select "no MT" when asked for "Number of cores:"


  3. Install the package
    Let's assume you have 6.3 in:
    /root/mysqlcluster-63/
    Install mysqlcluster-70.tar.gz
    cd /root/
    tar xvfz mysqlcluster-70.tar.gz
    cd mysqlcluster-70/cluster/scripts/install
    Copy the .s9s catalog from 6.3 so you get the hostnames
    cp  -r /root/mysqlcluster-63/cluster/scripts/install/.s9s  .
  4. Download the binary or build the source:
    ./download-binary.sh
    or
    ./download-and-compile.sh
  5. Install and perform a rolling restart
    ./install-cluster.sh
    cd ..
    ./rolling-restart.sh
  6. Change from ndbd --> ndbmtd
    vi ../config/config.ini
    Locate MaxNoOfExecutionThreads and, un-comment it if needed and set it to as many cores as you have:
    [ndbd default]
    ...
    MaxNoOfExecutionThreads=8
    ...
    Then change in the scripts from ndbd -> ndbmtd
    sed -i 's#libexec/ndbd#libexec/ndbmtd#g' *.sh
    Important! Don' leave out 'libexec' above!
    And do a rolling restart:
    ./rolling-restart.sh
  7. Luckily it is much easier to upgrade from e.g 7.0.6 to 7.0.x and I will show when 7.0.7 is out how to do that in two lines.

Monday, June 15, 2009

Webinar - MySQL Cluster: Architectural Deep Dive

There will be a webinar presented by Tomas Ulin (Director MySQL Server Technologies) and Matthew Keep (MySQL Cluster Product Management). The webinar covers:
  • how MySQL Cluster achieves predictable and consistent real time performance
  • how MySQL Cluster achieves continuous availability with sub-second fail-over, automated recovery and geographic replication
  • how MySQL Cluster enables users to dynamically scale throughput and data capacity
  • how to get started with MySQL Cluster
and I will be on the panel anwering your questions!

When: Wednesday 17th of June 2009, 0930PST/1230EST/1730GMT/1830CET/
See details and how to register here.

MySQL Cluster - versions NOT to use!

If you are using the following versions of MySQL Cluster:

MySQL Cluster 6.3.17 - SHOW TABLES can drop tables...
MySQL Cluster 6.3.18 - the above wasn't really fixed (fixed in 6.3.20)
MySQL Cluster 6.3.19 - the above wasn't really fixed (fixed in 6.3.20)
MySQL Cluster 6.3.21 - regression, pulled back because of this.

If you are on any of these versions - then upgrade!
The latest source and binary release is MySQL Cluster 6.3.24.

Friday, May 15, 2009

HA MySQL, write scaling using Cluster to non-cluster replication

There are various setups and solutions to solve the problem of having redundant mysql masters and scaling writes on them. MySQL Cluster can be an alternative. Here is why and how:
Pros:
  • No need to worry about DRBD failover and recovery times
  • No need to worry about potential data loss when failover from active master to standby master (has all binlog really been replicated over from the failed master?)
  • Writes go to any mysql server connected to the Cluster - thus writes are easy to scale
    The data nodes will then send the committed writes to every mysql server with --log-bin enabled.
  • MySQL Cluster can handle a lot of writes (but the slaves can become the bottleneck)
  • Data nodes can be added online (MySQL Cluster 7.0) - thus write scaling is easy and can be done on-demand
  • Add new, consistent, slaves online (thanks to the online backup in MySQL Cluster)
  • Simple and very fast master failover policy - all masters connected to the cluster will receive the same events, thus (when both are working) they will have the same content in the binary log - none is lagging behind.
  • No additional hardware compared to a normal HA setup of the masters (two computers are enough)
Worried that NDB stores data in RAM - well use disk based tables!

Setting up MySQL Cluster
You need to have at least:
  • 2 computers (HA with dual switches, nics and bonding, split brain only if both links are broken)
    - dual switches, dual nics (bonded).
    - each computer runs:
    1 data node (ndbd/ndbmtd),
    1 mysql server (master),
    1 ndb_mgmd
    or
  • 4 computers (better HA and no risk for network partitioning/split brain)
    - dual switches, dual nics (bonded).
    - two computers each runs:
    1 data node (ndbd/ndbmtd),
    - two computers each runs:
    1 mysql server (master), 1 ndb_mgmd.
Then it is of course possible to have more data nodes, more mysql servers etc, but we recommend two mysql servers that are masters in the replication so it is possible to failover between them.

With www.severalnines.com/config you can easily generate either of these setups, and have a running Cluster in a couple of minutes.

The idea here (just as in a normal HA MySQL setup) is to have two masters, master_a and master_b. The slaves will replicate from master_a and failover to master_b if master_a fails!

Note that you can write to _all_ mysql servers, and the data nodes will send the changes to the mysql servers that has --log-bin enabled.

Preparation of Slaves
You must chose between 1) and 2) below, else the replication won't work:

1. Enable INDEMPOTENCY in my.cnf of the slaves using slave-exec-mode:
--slave-exec-mode=IDEMPOTENT
2. if you want to use STRICT (default), then you have to set on the master mysql servers (my.cnf):
--ndb-log-update-as-write=O
Meaning that updates will be logged as updates, and not as writes.This downside with this is that twice as much as data will be sent (there is a before image and an after image for each update).

Each slave must have the table mysql.ndb_apply_status:
mysql_slave> use mysql;
mysql_slave> CREATE TABLE `ndb_apply_status` (
`server_id` int(10) unsigned NOT NULL,
`epoch` bigint(20) unsigned NOT NULL,
`log_name` varchar(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`start_pos` bigint(20) unsigned NOT NULL,
`end_pos` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`server_id`) USING HASH)
ENGINE=INNODB DEFAULT CHARSET=latin1;

This table is used to know what data has been applied on the slave and from which master (see below more a longer explanation).

Preparation of Masters (masterA, masterB)
Ensure you have the following in the my.cnf of the master mysql servers::
[mysqld]
...
#if you didnt
set --slave-exec-mode=IDEMPOTENT on the slaves you must set:
ndb-log-update-as-write=O
#else
ndb-log-update-as-write=1
#endif -- so pick one of the above!!

#enable cluster storage engine:
ndbcluster

#connect to cluster:
ndb-connectstring=<hostname of ndb_mgmd:port>

#sync after each epoch (10 ms of transactions) - here is a disk tuning possibility.
sync-binlog=1

#you may also want to set binlog_cache_size (if Binlog_cache_disk_use is growing):
binlog_cache_size=1048576

GRANT the replication user, e.g,
mysql>GRANT REPLICATION SLAVE ON *.* TO repl@'%' IDENTIFIED BY 'password'
Creating the tables

On MySQL Cluster
Create all the tables for the application(s) using engine=NDBCLUSTER.
On each slave
Create all the tables for the application(s) using e.g engine=MYISAM or engine=INNODB.
Staging a slave - with traffic on Cluster:
In order to get a consistent slave you have to do as follows:

Backup the Cluster
Using the management client:
ndb_mgm>start backup
Copy all the backup files to the same location.
If you use the Configurator scripts:
./start-backup.sh /path/mycentral_backupdir/
all backup files will be copied from the nodes to /path/mycentral_backupdir/

Remember the stopGCP=<some number> that is printed at the end of the backup.
If you use ndb_mgm -e "start backup" then it will look like this:
ndb_mgm -e "start backup"
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 14 started from node 1
Node 3: Backup 14 started from node 1 completed
StartGCP: 98278 StopGCP: 98281
#Records: 2058 #LogRecords: 0
Data: 51628 bytes Log: 0 bytes
If you use ./start-backup /tmp then it will look like this:
Backup has completed on data nodes.
Copying backup files to /tmp//BACKUP-2009May09-150743
BACKUP-15.3.log 100% 52 0.1KB/s 00:00
BACKUP-15-0.3.Data 100% 26KB 26.1KB/s 00:00
BACKUP-15.3.ctl 100% 11KB 10.6KB/s 00:00
BACKUP-15.4.log 100% 52 0.1KB/s 00:00
BACKUP-15-0.4.Data 100% 26KB 25.7KB/s 00:00
BACKUP-15.4.ctl 100% 11KB 10.6KB/s 00:00
Backup completed and backup files can be found in /tmp//BACKUP-2009May09-150743
Use the StopGCP to setup Cluster to non-cluster replication:
StopGCP: 98482
This is important as this tells where the backup ended, and from what point the slave should get the changes. This number will be used later when the replication is setup.

Convert Cluster backup data files to CSV files
For each backup file (for each data node in the master cluster) run:
ndb_restore -c <connectstring>
-b <backupid>
-n <nodeid>
--print-data
--print-log
--append
--tab=<output directory>
--fields-enclosed-by="'"
--fields-separated-by=","
--lines-terminated-by="\n"


or if you are using the Configurator scripts:
./restore-backup.sh --csv --backupdir=/path/mycentral_backupdir/ --csvdir=/mycsv/
This will create one txt file for each table and put it in --csvdir (Configurator) or --tab=<outputdir> for the vanilla ndb_restore.

It will also tell you (if you used start-backup.sh) the stop gcp:
./restore-backup.sh --csv
--backupdir=/cluster/backups/BACKUP-2009May10-165004
--csvdir=/cluster/backups/csv

...

To setup replication use StopGCP: 1511
Data is appended to the txt files, so running the command more than once is bad. Better delete the old txt files first.

Loading the .txt (CSV) into Slave
For all tables:
mysql_slave> LOAD DATA INFILE 'tablename.TXT'
INTO table <tablename>
FIELDS TERMINATED BY ','
ENCLOSED BY '\''
LINES TERMINATED BY '\n';


Now the slave is at the same state as the when the backup finished in the Cluster.

Staging a slave - no traffic on Cluster:

You can use mysqldump for this.
host> mysqldump -u<user> -p<password> <database> > database_dump.sql
Then you copy the database_dump.sql and change the engine=ndb to engine=innodb
host> sed -i -e 's#ndbcluster#innodb# database_dump.sql
Then you have to find the good starting position from master_a:
master_a> show master status;
Remember the binlog filename and position, you will need it in the next step.

Starting the replication
Create and grant the replication user (not shown here)

If you have staged the slave using mysqldump, then you go directly to CHANGE MASTER.. using the BINLOG and POS from the previous step.

Find out where the slave should start replicating from using the stopGCP from the previous step "Backup the Cluster".
mysql_masterA> SELECT @file:=SUBSTRING_INDEX(File, '/',-1),
@pos:=Position
FROM mysql.ndb_binlog_index
WHERE gci>@stopGCP ORDER BY gci ASC LIMIT 1;

if the query returns nothing (this means that no changes has happened on the cluster after the backup was completed) run:
mysql_masterA> SHOW MASTER STATUS;
Then change master
mysql_slave>   CHANGE MASTER TO MASTER_HOST='masterA',
MASTER_PORT=3306,
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='@file',
MASTER_LOG_POS=@pos;
mysql_slave> START SLAVE IO_THREAD;
## verify slave connected to masterA:
mysql_slave> SHOW SLAVE STATUS\g
if above is ok:
mysql_slave>   START SLAVE SQL_THREAD;
mysql_slave> SHOW SLAVE STATUS\g;

Master Failover
Failover from masterA to masterB

mysql_slave> STOP SLAVE;
mysql_slave> SELECT @epoch:=max(epoch) FROM mysql.ndb_apply_status;
mysql_masterB> SELECT @file:=SUBSTRING_INDEX(File, '/', -1),
@pos:=Position
FROM mysql.ndb_binlog_index
WHERE epoch>@epoch ORDER BY epoch ASC LIMIT 1;
If the query returns nothing (meaning nothing has been written to the master since the failure of masterA) do:
mysql_masterB>  SHOW MASTER STATUS;
#use the position and file in the query below:
Then change master

mysql_slave> CHANGE MASTER TO MASTER_HOST='masterA',
MASTER_PORT=3306,
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='@file',
MASTER_LOG_POS=@pos;
mysql_slave> START SLAVE:
Master crashes
When the master restarts after a crash, it will write a LOST_EVENTS event into its binary log.
If a LOST_EVENTS event is received on a slave, the slave will stop, and you have to do a Master Failover.

Master connection glitch between data node and a master mysql server
If the master would lose contact for a very short period of time with the cluster, then a LOST_EVENTS event will be written into the binary log. See above.

ndb_apply_status, epochs and ndb_binlog_index
The mysql.ndb_apply_status is very important as it stores what epochs from the Cluster that the slave has applied.
The epochs are consistent units and stores 10ms (controlled by TimeBetweenEpochs in config.ini) of committed transactions.
On the master side, the epochs are mapped to binary log filename and position in the mysql.ndb_binlog_index table.
The mysql.ndb_binlog_index table is a MYISAM table is local to each master and is the glue between what epoch has been applied on the slave, and what binary log and position it corresponds to for the master.

Monitoring
Replication - you can use Enterprise Monitor or whatever monitoring tool you prefer to monitor the replication and the mysql servers. Scripts for automatic failover must be implemented.
Cluster - you can use cmon to monitor the data nodes and the other components connected to the Cluster. The web interface will soon look better :)

Possible Extensions
You can of course give the master mysql servers one virtual IP and use Hearbeat to failover the IP. This simplifies for the applications.

Bugs and Issues - TIMESTAMP
Be careful with TIMESTAMP. Currently the ndb_restore program does not generate the timestamp using the MySQL TIMESTAMP, but rather it is a UNIX_TIMESTAMP.
I have written a bug report on this. In the meantime, you have to convert the TIMESTAMP columns in the .txt files to the MySQL TIMESTAMP format YYYY-MM-DD HH:MM:SS.

Please note that this is not a problem if you use mysqldump

I will update this part if you find any other problems - just let me know.

Thanks to Magnus Blåudd, Cluster Team, for your input on this post!

Friday, May 08, 2009

Cluster fails to start - self diagnosis (sanity check mysql cluster)

If the MySQL Cluster fails to start, what can you do and what to check?
Here are some sanity checks.

Initial system start
If it is the first time you start up the cluster and it fails, then check the following:
  • Network - Check /etc/hosts
    You should have (on the line for localhost)
    127.0.0.1 localhost
    and nothing else! Then usually, the normal IP address host mapping follows:
    10.0.1.10 hostA
    10.0.1.11 hostB
    ...
    REDHAT and others can add a lot of other things to the "localhost" line(s), and this results in that the nodes won't be able to connect to each other (they will be stuck in phase 0 or phase 1)
  • Network - Check if you can ping the machines
  • Network - Check if you have any firewalls enables ( e.g check with /sbin/iptables -L)
    Disable the firewall in that case. Exactly how depends on OS and Linux distribution.
    On Redhat systems, then SELinux might be enabled. Googling "disable firewall <your distro>" should give answers. Firewall is the most common culprit preventing the nodes in the cluster talking to each other.
  • RAM - Check if you have enough RAM to start the data nodes
    Check using 'top' on the computers where the data nodes running, while you start the data nodes. So always, have 'top -d1' running on the data nodes while they are starting up.
  • RAM - If you are allocating a lot of DataMemory, then you may also need to increase the parameter TimeBetweenWatchdogCheckInitial in [NDBD DEFAULT] of your config.ini. Set it to 60000 if you have >16GB of RAM.
  • Disk space - check using ' df -h' if you have enough space where the data nodes has its data directory.
  • CPU - if you use 7.0, enable multi-threading (8 cores) and only have a 4 core system or less, then there are chances that the Cluster won't come up. Competition for resources. I have seen this happening but no conclusive evidence yet.
  • OS - if you have a mix of OSs where the data nodes run, then it can be a problem. E.g, I have seen problems even when Fedora has been used on all machines, but one of the machines had a slightly older linux kernel. Also, it won't work if one of the nodes is a RH4 and the other is a RH5 (atleast mixing RH3 and RH4 doesn't).
So for the "initial start" it is mainly environmental factors preventing the cluster to start.
If you still have problems, ask on the Cluster Forum or MySQL Support if you have Support for advice.

Also, disable NUMA (Cluster is not NUMA aware) and make sure you dont SWAP!

System start
If you can't restart the cluster, and you haven't changed the configuration and haven't been filling up the disks with other things (i.e, check disk,ram, network as above) , then you have probably hit a bug. Ask on the Cluster Forum or MySQL Support if you have Support for advice.

In many cases it is recoverable by restarting one node in each node group (instead of all data nodes), and try out different combinations. When the "half" cluster has started, then you can restart the rest of the data nodes with --initial and they will sync up from the already started nodes.

Node (re)start
If you can't restart a failed data node, and you haven't changed the configuration and haven't been filling up the disks with other things (i.e, check disk,ram, network as above) , then you have probably hit a bug, but also there might have been corruption of the data files (this depends on how the computer/data node crashed).

You can try to do an initial node restart (see below).

Ask on the Cluster Forum or MySQL Support if you have Support for advice.

Initial Node (re)start
If you can't restart a failed data node with --initial, and you haven't changed the configuration and haven't been filling up the disks with other things (i.e, check disk,ram, network as above) , then you have probably hit a bug. Ask on the Cluster Forum or MySQL Support if you have Support for advice.

Collecting error data
The program 'ndb_error_reporter' is great to collect log files from the data nodes and management servers and puts them into a single bz file. Send this file to either Cluster Forum or MySQL Support if you have Support together with detailed steps what you have done.

Tuesday, April 28, 2009

Online configuration update and a new script

If you are using the scripts from the Configurator (and yes, the broken link is now fixed so it is actually possible to get the config package..), and want to tweak your configuration on a started Cluster then you can do like written below.

There is also a new script in Configurator 2.2 (released 28th April 2009, PST1153AM , GMT07:53) that lets you check the values of different config parameters. Its use is described below.

Pretend we want to extend the DataMemory from 1024M to 1536M:
  1. cd mysqlcluster-70/cluster/scripts
  2. vi ../config/config.ini
    This is the master config.ini file and will be copied out to the management server(s). Don't change /etc/mysql/config.ini, as that will have no effect.
  3. Set DataMemory=1536M then save and exit
  4. sh rolling-restart.sh
  5. When it has finished, verify that the Cluster has the new settings by running query-param.sh:
    sh query-param.sh DataMemory
    Parameter=DataMemory
    Data Node 3 : 1610612736
    Data Node 4 : 1610612736
    Please note that the output is in bytes.
The rolling-restart.sh script implements this procedure described in the reference manual.
Many other parameters can be changed using the above steps (e.g, MaxNoOfConcurrentOperations etc). Be careful though that you have memory available (the script does not check for that yet).

Parameters that change the file system one way or another (NoOfFragmentLogFiles, FragmentLogFileSize, Datadir, BackupDataDir, FileSystem*, InitFragmentLogfiles) must be changed with an initial rolling restart (all parameters are listed here in the ref manual):

Pretend we want to change the size of the Redolog (NoOfFragmentLogfiles).
From the beginning we have:
./query-param.sh NoOfFragmentLogFiles
Parameter=NoOfFragmentLogFiles
Data Node 3 : 6
Data Node 4 : 6

Then we modify our "template" config.ini and go ahead:
  1. vi ../config/config.ini
    This is the master config.ini file and will be copied out to the management server(s)
  2. Set NoOfFragmentLogFiles=7 then save and exit
  3. sh rolling-restart-initial.sh
  4. When it has finished, verify that the Cluster has the new settings by running query-param.sh:
    ./query-param.sh NoOfFragmentLogFiles
    Parameter=NoOfFragmentLogFiles
    Data Node 3 : 7
    Data Node 4 : 7
Have fun upgrading but remember - don't try to start with a smaller configuration on an already existing configuration (i.e., it is safe to increase parameters, but be very careful if you decrease).

Monday, April 27, 2009

MC 7.0 New Features, LDAP For MySQL Cluster - UC2009 sessions

The presentations from the sessions MySQL Cluster 7.0 - New Features and LDAP For MySQL Cluster - back-ndb, and all other sessions on MySQL Cluster, are available for download from www.mysqlconf.com.

Saturday, April 25, 2009

Easiest way to try out new MySQL Cluster 7.0.5

Here is how to install and start MySQL Cluster 7.0.5 in eight steps:
  1. Go to the configurator and setup the configuration you wish to have. Make sure you select "MySQL Cluster 7.0.x" in the first drop down.
  2. Copy the package you either download or receive by email to one of the computers that will run one of the management servers.
  3. tar xvfz mysqlcluster-70.tgz
  4. cd mysqlcluster-70/scripts/install
  5. Download MySQL Cluster 7.0 from MySQL website and put it in mysqlcluster-70/scripts/install
  6. sh dist-tgz.sh
  7. sh bootstrap.sh
  8. cd .. && sh start-cluster-initial.sh
Okay, step 1) requires that you setup shared ssh keys and a bit of typing and clicking, but gives you a production class configuration, and scripts to manage the cluster from a single point. Including backups and restores of MySQL Cluster!

I have mostly tested the scripts on Linux - please let me know about issues on other platforms!

Configurator - new version 2.1

The Configurator has undergone some serious fixing as 2.0 had some issues.
Here is a brief list of the enhancements:
  • error handing and progress reports - I have rewritten almost all scripts with better error handling and progress reports. E.g, if a node fails to start during a rolling restart, then the rolling restart script is aborted and you get a suggestion how to recover. If you run start-cluster-initial.sh on an already started cluster, it will refuse to run. There are many more changes like this to check the status of the nodes (both data nodes and sql nodes) during startup
  • download link to package - When you have entered the email address, then you will also get a link where to download the package, in case you experience problems with receiving the email (I have raised an issue to the hosting company about this)
  • restructure - I have also moved the install scripts to scripts/install and the monit scripts to scipts/extra. The install_initd_linux.sh uses chkconfig and it does not work on all linux distros. Going to fix that.
  • bugs - fixed a number of small bugs and annoyances.

Thursday, April 16, 2009

max_rows in cluster

If you are going to load a lot of records into Cluster, don't forget to set max_rows!
My colleague, Yves at BigDBAhead, has also blogged about this, but I also ran into the same problem recently.

I did try to populate 100M records on a 4 node cluster, and the data nodes went down with the following error message in the error logs:

"2304 Array index out of range"

So the error message is crap - and my opinion is that there should be a proper error message propagated up to the mysql server. There is a bug report on this.

Simplified, what the error message means is that you have run out of "index slots" in the Hash Table storing the hashes of the Primary Keys. This is because each table is divided into a number of partitions, and each partition can only store X number of records in the hash table (about 100M).

The workaround when creating huge tables is to set max_rows=<2 x the number of records that you plan to store in the table>. 2x is just to be safe. Internally this will create more partitions. Internally the cost of this in terms of memory is negligible.

Example:
CREATE TABLE t1 (
a INTEGER PRIMARY KEY
) ENGINE=ndbcluster MAX_ROWS=1000000000