Showing posts with label tuning. Show all posts
Showing posts with label tuning. Show all posts

Wednesday, November 03, 2010

DiskPageBufferMemory tuning and disk data statistics in MySQL Cluster 7.1.9

From MySQL Cluster 7.1.9 (not yet released) it is possible to get better stats on disk data tables. In fact, the statistics makes it possible to tune the DiskPageBufferMemory parameter (similar to innodb_bufferpool), in order to avoid disk seeks. It is much (understatement) faster to fetch data from the DiskPageBufferMemory than disk.

Here is an example/tutorial how to use this information and how to check the hit ratio of the DiskPageBufferMemory. Next time, I will explain about other counters you can get from ndbinfo.diskpagebuffer.

Finally, no more educated guesswork is needed.

Let's take an example.

I have a table t1 with 650000 record
CREATE TABLE `t1` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`data1` varchar(512) DEFAULT NULL,
`data2` varchar(512) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=ndbcluster TABLESPACE ts_1 STORAGE DISK ENGINE=ndbcluster
data1 and data2 is non-indexed columns so they will be placed on disk.
Then doing Random reads on this table with bencher:
 src/bencher -r 100000 -t 1 -q \
"SET @x=FLOOR(1 + (RAND() * 650000)); SELECT * FROM t1 WHERE id=@x"
DiskPageBufferMemory=64M
The initial performance after 340 seconds into the run:
Thread 0 - 274 tps (average tps measured after 335 secs)
Thread 0 - 273 tps (average tps measured after 340 secs)
and the hit ratio:
mysql> SELECT
node_id,
100*(sum(page_requests_direct_return)/
(sum(page_requests_direct_return)+sum(page_requests_wait_io))) as hit_ratio_pct
FROM diskpagebuffer GROUP BY node_id;
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 3 | 93.3577 |
| 4 | 93.6565 |
+---------+---------------+

2 rows in set (0.01 sec
A hit ratio of ~93% - not so great.

So let's increase the DiskPageBufferMemory ( i am using the configurator scripts for this)
# change DiskPageBufferMemory to 512M in config.ini
vi ../config/config.ini
# do a rolling restart of the data nodes
./rolling-restart.sh --layer=storage
## the sit back and wait for the data nodes to be rolled..
hostx: (324s elapsed) Node 3: starting (Last completed phase 100) (mysql-5.1.51 ndb-7.1.9)
hostx: Data node 3 is STARTED
## then wait for node 4..
hosty: (98s elapsed) Node 4: starting (Last completed phase 4) (mysql-5.1.51 ndb-7.1.9)

Increasing DiskPageBufferMemory=512M
Of course, while doing the rolling restart, the same traffic is still happening on the cluster. Let's check what the hit_ratio is now (while restarting node 3).

695 seconds into the run we have:
mysql>  "see above"
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 4 | 91.2639 |
+---------+---------------+
1 row in set (0.00 sec)
and perf is...
Thread 0 - 200 tps (average tps measured after 690 secs)
Thread 0 - 200 tps (average tps measured after 695 secs)
not good at all.

After increase to DiskPageBufferMemory=512M

When the rolling restart has completed the ./rolling-restart.sh prints out:
Cluster: Cluster has performed a rolling restart  
The DiskPageBufferMemory has now been changed to 512M.

Let's check what hit ratio we get:
mysql>  "see above"
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 3 | 98.2487 |
| 4 | 98.6386 |
+---------+---------------+
2 rows in set (0.01 sec)
Hit ratio is picking up and performance as well:
Thread 0 - 1141 tps (average tps measured after 1310 secs)
Thread 0 - 1149 tps (average tps measured after 1315 secs)
Now, we just have to wait for a while and let the system enter a steady state.
Then we can check back on the hit ratio and perhaps increase the DiskPageBufferMemory even more.

If you have 99.5 -- 99.9% hit ratio i would say you are good.

After 1780 seconds into the run we have:
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 3 | 99.2035 |
| 4 | 99.5251 |
+---------+---------------+
2 rows in set (0.00 sec)

and
Thread 0 - 1669 tps (average tps measured after 1775 secs)
Thread 0 - 1674 tps (average tps measured after 1780 secs)
... wow .. tuning the DiskPageBufferMemory really helped. And 3300 secs into the run, perf and hit ratio is still climbing:
Thread 0 - 2329 tps (average tps measured after 3305 secs)
Thread 0 - 2330 tps (average tps measured after 3310 secs

Good luck!

Thursday, September 02, 2010

Cluster - spins/contentions and thread stuck in..

I get a number of question about contentions/"stuck in..". So here comes some explanation to:
  • Contention
  • Thread Stuck in
  • What you can do about it
In 99% of the cases the contentions written out in the out file of the data nodes (ndb_X_out.log) is nothing to pay attention to.

sendbufferpool waiting for lock, contentions: 6000 spins: 489200
sendbufferpool waiting for lock, contentions: 6200 spins: 494721


Each spin is read from the L1 cache (4 cycles on a Nehalem (3.2GHz), so about a nanosecond).
1 spin = 1.25E-09 seconds (1.25ns)

In the above we have:
(494721-489200)/(6200-6000)= 27 spins/contention
Time spent on a contention=27 x 1.25E-09=3.375E-08 seconds (0.03375 us)

So we don't have a problem..

Another example (here is a lock guarding a job buffer (JBA = JobBuffer A, in short it handles signals for heartbeats and some other small things, all traffic goes over JobBuffer B).

jbalock thr: 1 waiting for lock, contentions: 145000 spins: 3280892543
jbalock thr: 1 waiting for lock, contentions: 150000 spins: 3403539479


(3403539479-3280892543)/(150000-145000)=24529 spins/contention
Time spent on a contention: 3.06613E-05 seconds (30.66us )

This is a bit higher than I would have expected and I think more analysis is needed. However, i tend not to get these contentions on a busy system.

Ndb kernel thread X is stuck in ..

Ndb kernel thread 4 is stuck in: Job Handling elapsed=100 Watchdog: User time: 82 System time: 667
Ndb kernel thread 4 is stuck in: Job Handling elapsed=200

Watchdog: User time: 82 System time: 668
Ndb kernel thread 4 is stuck in: Job Handling elapsed=300
Watchdog: User time: 82 System time: 669
Ndb kernel thread 4 is stuck in: Job Handling elapsed=400
Watchdog: User time: 82 System time: 670

Here the important is to look at how User time and System time behaves.
If User time is constant (as it is here - 82ms), but the System time is growing (667, 668 etc) which indicates that the OS kernel is busy.
Slow network? Sub-optimal kernel version? NIC drivers? swapping? some kernel process using too much cpu?

If User time is growing it is probably because the ndb kernel is overloaded.

What can you do about this?
  • In config.ini:
    RealtimeScheduler=1

    LockExecThreadToCPU=[cpuids]
  • check that cpuspeed is not running ( yum remove cpuspeed )
  • .. and finally ask us to optimize more!
Also, pay attention if you get the contentions on an idle Cluster or a busy Cluster.

Tuesday, April 20, 2010

Tuning your Cluster with ndbinfo (7.1) part 1 of X

The new ndbinfo interface in 7.1 is really useful to assist in tuning MySQL Cluster. Here is an example (more will follow):

I started with one test where I inserted two blobs (1KB + 1KB) in one table.
From 16 threads (colocated with one mysqld, two data nodes, separate computers) and one application driving the load I reached about 6960TPS, and the utilization of the redo buffers (controlled by the parameter RedoBuffer in config.ini) looked like:

mysql< select * from ndbinfo.logbuffers;
+---------+----------+--------+----------+----------+--------+
| node_id | log_type | log_id | log_part | total | used |
+---------+----------+--------+----------+----------+--------+
| 3 | REDO | 0 | 1 | 50331648 | 196608 |
| 3 | REDO | 0 | 2 | 50331648 | 294912 |
| 3 | REDO | 0 | 3 | 50331648 | 131072 |
| 3 | REDO | 0 | 4 | 50331648 | 229376 |
| 4 | REDO | 0 | 1 | 50331648 | 229376 |
| 4 | REDO | 0 | 2 | 50331648 | 262144 |
| 4 | REDO | 0 | 3 | 50331648 | 163840 |
| 4 | REDO | 0 | 4 | 50331648 | 229376 |
+---------+----------+--------+----------+----------+--------+
8 rows in set (0.01 sec)
Which is basically nothing.

I then increased the load and inserted 2 x 5120B BLOBs (from 16 threads one MySQL server), and run with an insert speed of 4320TPS:
mysql< select * from ndbinfo.logbuffers;
+---------+----------+--------+----------+----------+----------+
| node_id | log_type | log_id | log_part | total | used |
+---------+----------+--------+----------+----------+----------+
| 3 | REDO | 0 | 1 | 50331648 | 11468800 |
| 3 | REDO | 0 | 2 | 50331648 | 31522816 |
| 3 | REDO | 0 | 3 | 50331648 | 42008576 |
| 3 | REDO | 0 | 4 | 50331648 | 43057152 |
| 4 | REDO | 0 | 1 | 50331648 | 14090240 |
| 4 | REDO | 0 | 2 | 50331648 | 17432576 |
| 4 | REDO | 0 | 3 | 50331648 | 10321920 |
| 4 | REDO | 0 | 4 | 50331648 | 12615680 |
+---------+----------+--------+----------+----------+----------+

Above you can see that the redo buffers are used (the load will be spread around, and it is hard to catch a moment where the load is even on all buffers), and now the application started to throw the error "Got temporary error 1221 'REDO buffers overloaded (increase RedoBuffer)' from NDBCLUSTER (1297)"

I can now follow the instruction to increase the REDO buffer, but would it help in this case?
No, no and no.
The disk is too slow to keep up and cannot write out to disk in the same rate as the application writes out.

'iostat' gives:
< iostat -kx 1

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
cciss/c0d0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cciss/c0d1 0.00 27796.00 0.00 1454.00 0.00 115196.00 158.45 12.03 8.25 0.66 95.30
dm-0 0.00 0.00 0.00 29270.00 0.00 117080.00 8.00 274.79 9.33 0.03 95.20
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


And here you can see that the disks are quite utilized. This means that I have two options now if I want to be able to sustain the 4320TPS insert load:
  • Increase the number of data nodes (computers) so instead of having two computers, I should have four so that I spread the load across more hardware
  • Improve my disk subsystem (add better disks, e.g, to have 2-4 disk spindles to spread the load on), or by having the REDO log on device cciss/c0d1 and the the LCP on device cciss/c0d0.
The CPU, could that also been an bottleneck in this case? No, it was not the issue. The CMVMI thread (one of the data nodes threads) was spending 44.4% polling data from the other nodes, and it is reading in quite large packets so that is why it was the heaviest user of CPU of the data node threads.
5453 root      20   0 6594m 4.1g 6956 R 44.4 51.9   4:05.64 ndbmtd
5471 root 20 0 6594m 4.1g 6956 S 32.5 51.9 3:39.07 ndbmtd
5474 root 20 0 6594m 4.1g 6956 R 26.6 51.9 2:25.55 ndbmtd
5475 root 20 0 6594m 4.1g 6956 S 23.7 51.9 2:25.01 ndbmtd
5476 root 20 0 6594m 4.1g 6956 R 23.7 51.9 2:20.83 ndbmtd
5473 root 20 0 6594m 4.1g 6956 R 21.7 51.9 2:26.57 ndbmtd