Johan Andersson's Cluster and HA Blog: November 2009

Friday, November 27, 2009

Node failure handling - take 2

Jonas improved the node failure handling in MySQL Cluster 7.0.9 (and 6.3.29) so here comes a re-run of the a previous blog post. And the node failure handling time has improved a lot - see below.

I created in total 11345 tables, each with 128 columns, and then hit this bug.

When all tables were created, I stopped one node and measured how long time it takes for the other node to perform the node failure handling. Here is what was written into the cluster log (look at the bold lines):


2009-11-27 13:39:21 [MgmtSrvr] ALERT    -- Node 4: Node 3 Disconnected

2009-11-27 13:39:21 [MgmtSrvr] ALERT    -- Node 4: Network partitioning - arbitration required

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: President restarts arbitration thread [state=7]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: Communication to Node 3 closed

2009-11-27 13:39:21 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected

2009-11-27 13:39:21 [MgmtSrvr] ALERT    -- Node 4: Arbitration won - positive reply from node 1

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: GCP Take over started

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: Node 4 taking over as DICT master

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: GCP Take over completed

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: kk: 9670/4 0 0

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: LCP Take over started

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: LCP Take over completed (state = 5)

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0

2009-11-27 13:39:21 [MgmtSrvr] INFO     -- Node 3: Node shutdown completed.

2009-11-27 13:39:21 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected

2009-11-27 13:39:22 [MgmtSrvr] INFO     -- Node 4: Started arbitrator node 1 [ticket=2a74000295047d39]

2009-11-27 13:40:20 [MgmtSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 1 min. - state = 6

2009-11-27 13:41:03 [MgmtSrvr] INFO     -- Node 4: Communication to Node 3 opened

Now, the node failure handling was completed in 1 minute 42 seconds! This is quite some improvement from the 17 minutes (although measured with 16000 tables and I will comeback when the bug mentioned above is fixed) measured before Jonas made his magic.

What does it mean? It means that we can start to recover a failed data nodes much earlier now!

Thursday, November 26, 2009

Scaling - going from 2 to X data nodes

When moving from two data nodes to a bigger Cluster it is not necessarily true that you will have better performance. In fact you can get worse.

Here are some things to think about:

Database Load (traffic to Cluster) - if you can handle query load X on a two node cluster and move the same load X to a four data node cluster you will likely get new_load_X=0.8X, i.e., a performance degradation. This has to do with 1) buffers are not filled up fast enough so the data nodes will do "timeout" based sending or 2) that the access patterns aren't scaling. To correct 1) you need to increase the load on the cluster so that internal communication buffers fill up faster.

Access pattern related "problems":

For primary key operations (reads, updates, deletes) you will always go to the correct node to fetch the data with PK operations, no matter how many nodes you have. So no problems with these types of ops.
For index scans on a key you will always scan all data nodes. This can be expensive if the number of records you retrieve is "small" (about <=64 records), since setting up the index scan on every data node is expensive. Sending the actual data is relatively cheap. This can be the cause of performance degradation when moving from 2 data nodes to bigger cluster. This is not so much to do about, unless you can redesign your schema for most important requests to fit the index scans on part of PK way of accessing data.
For index scans on part of PK there are two options - index scan all data nodes or index scan only one data node. This can also be the cause of performance degradation when moving from 2 data nodes to bigger cluster.

Example with index scans on part of PK:

CREATE TABLE t1 (
userid integer,
serviceid integer,
data varbinary(512),
INDEX (serviceid),
PRIMARY KEY(userid, serviceid)
);

This table is by default partitioned on the primary key (every table in cluster is by default partitioned on the primary key). This means in a four data node cluster that you can have your user spread out on 4 different partitions.
You may have this data

<userid,serviceid,>
<1,1,aaa>
<1,2,aaa>
<1,3,aaa>
<1,4,aaa>

The record <1,1,aaa> may be on one data node. <1,2,aaa> may be on another and so on. It is random.

If you now do a SELECT * FROM t1 WHERE userid=<value>
then every data node will be scanned for data, the result then merged in the NDBAPI (mysqld).

Good or bad? Depends on - if your user has >64 services (CPU and network dependent), then it is better to scan all data nodes.

If your user has <=64 services then it is better to involve only one data node.

You can then create the table as follows:

CREATE TABLE t1 (
userid integer,
serviceid integer,
data varbinary(512),
INDEX (serviceid),
PRIMARY KEY(userid, serviceid)
) PARTITION BY KEY (userid);

- You may have still have this data:

<userid,serviceid,>
<1,1,aaa>
<1,2,aaa>
<1,3,aaa>
<1,4,aaa>

All records with userid=1 is now located in one data node (one partition) no matter how many data nodes you have!

Now, the primary key is still <userid,> but the data is put on the partitions by looking at <userid> only! This ensures that a user will be on only one partition, irrespectively how many services the user has. If you query on <userid> then only one partition will be involved in the search, since for a particular <userid> in the table t1, all data for that user is in only one partition.

If you now do a SELECT * FROM t1 WHERE userid=<value> then only one data node will be scanned.

For completeness of the example: SELECT * FROM t1 WHERE serviceid=<value> will cause all data nodes to perform an index scan on the index (serviceid), since the data is not partitioned by serviceid, and not the leftmost column in the index (the indexed columns are used left to right).

The main point is that the fixed cost of setting up an ordered index scan is expensive, but then to retrieve the records is cheap. So if you return many rows, read from all data nodes.

If you later on plan to increase the number of data nodes from 4 to 8 nodes. Then using PARTITION BY KEY when possible makes your upgrades (more data nodes) easier and less susceptible to performance decreases.

This is how it looks on my system (as you can see partitioned searches (using PARTITION BY KEY) is faster up to ~ 64 records).

MySQL Cluster 7.0.9b in Configurator and Sandbox

Configurator and Sandbox scripts are updated to use 7.0.9b.

You are recommended to upgrade!

Upgrade can be performend using these scripts:
Binary distribution - Upgrade script from 7.0.* -> 7.0.9b is here.
Source distribution - Upgrade script from 7.0.* -> 7.0.9b is here.

Copy the scripts to "install/"
chmod u+x upgrade-script.sh
./upgrade-script.sh

After step 3) you need to run either:

download-binary.sh
download-and-compile.sh

and then

rolling-restart.sh

Johan Andersson's Cluster and HA Blog

Friday, November 27, 2009

Node failure handling - take 2

Thursday, November 26, 2009

Scaling - going from 2 to X data nodes

MySQL Cluster 7.0.9b in Configurator and Sandbox

Blog Archive

About Me

Good resources

Blogs