Johan Andersson's Cluster and HA Blog: April 2010

Tuesday, April 27, 2010

MySQL Cluster - SPJ Preview - Feedback welcome

SPJ (preview, not production ready) is a new feature allowing some types of JOINs to be pushed down and executed inside the data nodes! This allows for, in many cases, much faster JOIN execution.

Now we would love to get your feedback on this new feature:

Does what we have right now improve performance for you?
Are there other types of JOINs we should support in order to improve performance in you application (currently only eq_ref is supported)?
What application are you using?

There are some limitations currently:

node failure handling of SPJ is not complete, so if a data node crash, there are side-effects.
only eq_ref is supported - other JOINs are executed as normal.
Don't put this in production.

Obtaining the MySQL Cluster SPJ preview version:

You must know how to build MySQL Cluster from source
www.severalnines.com/config includes "7.1 SPJ Preview" (the scripts will compile and distribute the binaries for you) - just select "MySQL Cluster 7.1.3-SPJ" from the Cluster Version drop down.
The source code (if you don't use severalnines/config) can be found here.

How to test is:

Run a JOIN query:
mysql> set ndb_join_pushdown=0;
mysql> run join query
mysql> set ndb_join_pushdown=1; //ENABLES SPJ
mysql> run join query again
Did it help?
Run EXPLAIN on the queries (especially if SPJ did not help)

Feedback:

Send us the EXPLAINs of the queries where SPJ did not help!
or
Let us know if this feature was helpful for your application.
and in any case please include
What kind of speed up (or decrease) was observed
What kind of cluster configuration/hw was used.

Please send your feedback to us: spj-feedback (at) sun (dot) com . Thank you!

Tuesday, April 20, 2010

Tuning your Cluster with ndbinfo (7.1) part 1 of X

The new ndbinfo interface in 7.1 is really useful to assist in tuning MySQL Cluster. Here is an example (more will follow):

I started with one test where I inserted two blobs (1KB + 1KB) in one table.
From 16 threads (colocated with one mysqld, two data nodes, separate computers) and one application driving the load I reached about 6960TPS, and the utilization of the redo buffers (controlled by the parameter RedoBuffer in config.ini) looked like:

mysql< select * from ndbinfo.logbuffers;
+---------+----------+--------+----------+----------+--------+
| node_id | log_type | log_id | log_part | total    | used   |
+---------+----------+--------+----------+----------+--------+
|       3 | REDO     |      0 |        1 | 50331648 | 196608 |
|       3 | REDO     |      0 |        2 | 50331648 | 294912 |
|       3 | REDO     |      0 |        3 | 50331648 | 131072 |
|       3 | REDO     |      0 |        4 | 50331648 | 229376 |
|       4 | REDO     |      0 |        1 | 50331648 | 229376 |
|       4 | REDO     |      0 |        2 | 50331648 | 262144 |
|       4 | REDO     |      0 |        3 | 50331648 | 163840 |
|       4 | REDO     |      0 |        4 | 50331648 | 229376 |
+---------+----------+--------+----------+----------+--------+
8 rows in set (0.01 sec)

Which is basically nothing.

I then increased the load and inserted 2 x 5120B BLOBs (from 16 threads one MySQL server), and run with an insert speed of 4320TPS:

mysql< select * from ndbinfo.logbuffers;
+---------+----------+--------+----------+----------+----------+
| node_id | log_type | log_id | log_part | total    | used     |
+---------+----------+--------+----------+----------+----------+
|       3 | REDO     |      0 |        1 | 50331648 | 11468800 |
|       3 | REDO     |      0 |        2 | 50331648 | 31522816 |
|       3 | REDO     |      0 |        3 | 50331648 | 42008576 |
|       3 | REDO     |      0 |        4 | 50331648 | 43057152 |
|       4 | REDO     |      0 |        1 | 50331648 | 14090240 |
|       4 | REDO     |      0 |        2 | 50331648 | 17432576 |
|       4 | REDO     |      0 |        3 | 50331648 | 10321920 |
|       4 | REDO     |      0 |        4 | 50331648 | 12615680 |
+---------+----------+--------+----------+----------+----------+

Above you can see that the redo buffers are used (the load will be spread around, and it is hard to catch a moment where the load is even on all buffers), and now the application started to throw the error "Got temporary error 1221 'REDO buffers overloaded (increase RedoBuffer)' from NDBCLUSTER (1297)"

I can now follow the instruction to increase the REDO buffer, but would it help in this case?
No, no and no.
The disk is too slow to keep up and cannot write out to disk in the same rate as the application writes out.

'iostat' gives:

< iostat -kx 1

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
cciss/c0d1        0.00 27796.00    0.00 1454.00     0.00 115196.00   158.45    12.03    8.25   0.66  95.30
dm-0              0.00     0.00    0.00 29270.00     0.00 117080.00     8.00   274.79    9.33   0.03  95.20
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

And here you can see that the disks are quite utilized. This means that I have two options now if I want to be able to sustain the 4320TPS insert load:

Increase the number of data nodes (computers) so instead of having two computers, I should have four so that I spread the load across more hardware
Improve my disk subsystem (add better disks, e.g, to have 2-4 disk spindles to spread the load on), or by having the REDO log on device cciss/c0d1 and the the LCP on device cciss/c0d0.

The CPU, could that also been an bottleneck in this case? No, it was not the issue. The CMVMI thread (one of the data nodes threads) was spending 44.4% polling data from the other nodes, and it is reading in quite large packets so that is why it was the heaviest user of CPU of the data node threads.

5453 root      20   0 6594m 4.1g 6956 R 44.4 51.9   4:05.64 ndbmtd
5471 root      20   0 6594m 4.1g 6956 S 32.5 51.9   3:39.07 ndbmtd
5474 root      20   0 6594m 4.1g 6956 R 26.6 51.9   2:25.55 ndbmtd
5475 root      20   0 6594m 4.1g 6956 S 23.7 51.9   2:25.01 ndbmtd
5476 root      20   0 6594m 4.1g 6956 R 23.7 51.9   2:20.83 ndbmtd
5473 root      20   0 6594m 4.1g 6956 R 21.7 51.9   2:26.57 ndbmtd

Wednesday, April 07, 2010

MySQL Cluster - BLOB performance and other things

At the UC 2010 I will have a session on MySQL Cluster Performance Tuning. This session will address a lot of the most common performance problems I see in my day to day job with Cluster. A small excerpt of the talk is below here and many other things will be addressed in the session (JOINs, schema optimization, batching, indexes, parameter tuning etc). I hope to see you there!

First a bit on ndb_autoincrement_prefetch_sz and then blobs!

ndb_autoincrement_prefetch_sz

One thing is contention on auto_increments, which can really slow down performance.
By default the ndb_autoincrement_prefetch_sz=1. This means that the mysqld will cache one auto_increment number and then go down to the data nodes to fetch the next number. It is better to let the mysqld cache more numbers, to avoid the unnecessary round-trip.

From MySQL Cluster 7.0.13 you can set it up to 65536 (previously max was 256)

With ndb_autoincrement_prefetch_sz=1024 the mysqld will cache 1024 numbers before fetching the next range of numbers from the data nodes.

Som numbers for inserting batches of 16 records from 8 concurrent threads on one mysqld:

ndb_autoincrement_prefetch_sz=1:                1211.91TPS
ndb_autoincrement_prefetch_sz=256:              3471.71TPS
ndb_autoincrement_prefetch_sz=1024:             3659.52TPS

Here we got an 3x improvement on INSERT performance. Nice!

This test is by no means trying to max out the cluster, far from it, just to illustrate how important it is to set the ndb_autoincrement_prefetch_sz.

BLOB/TEXT

Another is on BLOB/TEXT attributes - which in many cases are overused in applications (e.g, there is no reason to store an 'email' as a TEXT, or if the data is less than about 8000B).

If you can change them to VARBINARY/VARCHAR (as LinuxJedi suggests), do it:

BLOBs/TEXTs are significantly slower compared to VARBINARY/VARCHAR (because the BLOBs are stored in a separate table, and need to be locked with at least a shared lock when accessed).

CREATE TABLE `t1_blob` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`data1` blob,
`data2` blob,
PRIMARY KEY (`id`)
)ENGINE=ndbcluster

and inserted 200K records. sizeof(data1) = 1024B, sizeof(data2) = 1024B.

SELECT data1, data2 FROM t1 WHERE id=<rand>

Executed from 1 App - 8 Threads on one MySQLD and two data nodes resulted in:

data1 and data2 represented as BLOBs : 5844 TPS
data1 and data2 represented as VARBINARYs: 19206 TPS

Again, this is not maxing out the data nodes, just to show you the idea.

Conclusion
If you can replace TEXT/BLOB with VARCHAR/VARBINARY (remember that the max record size in Cluster is 8052B) do it.

Johan Andersson's Cluster and HA Blog

Tuesday, April 27, 2010

MySQL Cluster - SPJ Preview - Feedback welcome

Tuesday, April 20, 2010

Tuning your Cluster with ndbinfo (7.1) part 1 of X

Wednesday, April 07, 2010

MySQL Cluster - BLOB performance and other things

Blog Archive

About Me

Good resources

Blogs