Tuesday, March 31, 2009

MySQL Cluster @ UC

The UC is closing in fast and there are a number of (potentially) interesting talks around MySQL Cluster.
Personally, I will give two talks (sessions):
  • MySQL Cluster 6.4 (7.0) - New Features
    I will talk about ONLINE ADD NODE, and multithreaded data node, and the rest of the things that are new in 7.0. Please note that MySQL Cluster 6.4 changed name to 7.0, but the session title didn't.
    When and Where: Ballroom A at 05:15pm, Tuesday, 04/21/2009
  • LDAP for MySQL Cluster- back-ndb
    Together with Howard Chu, CTO of Symas, we will present OpenLDAP for MySQL Cluster. Check out this blog post for some basic information about OpenLDAP for MySQL. We will talk about how it scales, performance, limitations, etc etc.
    When and Where: Ballroom B at 02:00pm, Thursday, 04/23/2009
A list of talks about MySQL Cluster can be found here!

LDAP and MySQL Cluster

There are two LDAP Directory Servers available that supports MySQL Cluster - OpenLDAP (supported and maintained by Symas Corp.) and OpenDS (Sun Microsystems). Both of them have implemented a back-end called back-ndb that talks direclty to the data nodes. This means that they use the NDBAPI directly to access data in the cluster, thus bypassing the MySQL Server.

Using MySQL Cluster as the back-end makes it possible to easily scale out the LDAP layer without using replication between LDAP servers. If you need to have more capacity in the LDAP layer, add another LDAP server (online, no service interruption), if you need more storage capacity, add data nodes (online, no service interruption). This offers incredible scalability. And no single point of failure.

But since the data is stored in MYSQL Cluster, you can also use SQL to access the data, simultaneously with the LDAP accesses. See the picture below.



OpenDS actually uses ndb-bindings which is a Java layer (JNI wrapper) on top of the C++ NDBAPI. OpenLDAP use the Direct C++ NDBAPI directly. The use of Direct APIs makes a good foundation for high performance.

Besides that there is the usual differences in functionality you can expect from two different Directory Servers (I am not going into details about that here), but both implementation has the similar layout when it comes to the data model that is used inside MySQL Cluster to support LDAP accesses. This similarity comes from the fact that I have had my fingers in both implementations :) . However, I must say that I know the implemenation of OpenLDAP better.

If you want more details you should come to the MySQL User Conference 2009. During the UC, I will have a session together with Howard Chu on OpenLDAP for MySQL Cluster. There will also be a BOF for OpenDS (TBD exaclty when, since the BoFs have not been scheduled yet when this post was written (31st of March, 2009), but keep your eyes open!

If you are interested in getting started with OpenLDAP for MySQL Cluster, then you can follow this quickstart guide.

I have also got lualdap to work with OpenLDAP for MySQL Cluster. Quite cool, since it then can be accessed from the MySQL Proxy (thanks to Stephane Varoqui, Star Consultant MySQL/Sun based in Paris, for this idea)... This opens up quite some interesting things when it comes to authentication.

Thursday, March 19, 2009

online add node - preliminary numbers (2)

Scenario: Grow from two data nodes to four using ONLINE ADD NODE and REORGANIZE PARTITION of a 2GB table, and UPDATE traffic

Setup for each data node:
* Dual cpu - quad core running at 2.33GHz.
* 8GB of RAM
* DataMemory=4096MB
* MaxNoOfExecutionThreads=8
* Configuration generated by Configurator
* Bound execution threads to cores (LockExec/Maint....Thread..)
* Completely distributed setup.

Created the following table and filled it with 2M rows which gives a table (different of last time to simplify further tests) of 2GB in size:

CREATE TABLE `t1` (
`a` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`b` int(10) unsigned DEFAULT NULL,
`c` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`d` varbinary(768) DEFAULT NULL,
`e` varbinary(256) DEFAULT NULL,
PRIMARY KEY (`a`),
KEY `c` (`c`)
) ENGINE=ndbcluster


Baseline (two data nodes)
Random updates of 256B:
UPDATE t1 SET e=<data> WHERE a=<random_value>
where DATA is 256 random bytes and RANDOM_VALUE=random value between 1..2M

Running with bencher-0.12:
src/bencher -t 20 -l 100000
20 threads
Total throughput = 18315.06 qps
Average response time=1089us for each update

CPU Util details at the end of this post (please note I was not trying to max out the data nodes with this test).

With reorg on + no load
mysql> alter table t1 reorganize partition;
Query OK, 0 rows affected (2 min 44.58 sec)
Records: 0 Duplicates: 0 Warnings: 0



With UPDATEs and ongoing reorg
ongoing reorg means: ALTER TABLE t1 REORGANIZE PARTITIONS;
20 threads each doing updates as in baseline
Total throughput = 10907.70 qps
Average response time=1827us for each update
And the reorg took:

mysql> alter table t1 reorganize partition;
Query OK, 0 rows affected (3 min 39.45 sec)
Records: 0 Duplicates: 0 Warnings: 0

Cost in performance during reorg
Throughput decreases with about 40% during the reorg.
Latency increases with about 68% during the reorg.


CPU DETAILS (baseline)



CPU DETAILS (with UPDATE + reorg)

online add node - preliminary numbers

I want to measure the time it takes to do various ONLINE ADD NODE operations in MySQL Cluster 7.0 and especially how long time it takes to repartition the data onto the newly added nodes. So here comes the first post on the subject.

First off: Go from two data nodes to four using ONLINE ADD NODE and REORGANIZE PARTITION of 2GB tables, and no traffic system.

Setup for each data node:
  • Dual cpu - quad core running at 2.33GHz.
  • 8GB of RAM
  • DataMemory=4096MB
  • MaxNoOfExecutionThreads=8
  • Configuration generated by Configurator
  • Bound execution threads to cores (LockExec/Maint....Thread..)
  • Completely distributed setup.
Created the following table and filled it with 3M rows which gives a table of 2GB in size:

CREATE TABLE `t1` (
`a` bigint(20) unsigned NOT NULL,
`b` int(10) unsigned DEFAULT NULL,
`c` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`d` varbinary(1024) DEFAULT NULL,
PRIMARY KEY (`a`),
KEY `c` (`c`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1

Running add_node.sh to add two more data nodes (5 and 6):

ndb_mgm> all status
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.32 ndb-7.0.4)
Node 4: started (mysql-5.1.32 ndb-7.0.4)
Node 5: not connected
Node 6: not connected

Running the add_node.sh script (no traffic to db) took 14m48.535s (bulk of the time spent in rolling restart of the data nodes 3 and 4 - there is another way of doing this part without rolling restart, but that requires initial planning.. i will revisit that soon).

After the add_node.sh script finished :
ndb_mgm> all status
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.32 ndb-7.0.4)
Node 4: started (mysql-5.1.32 ndb-7.0.4)
Node 5: started (mysql-5.1.32 ndb-7.0.4)
Node 6: started (mysql-5.1.32 ndb-7.0.4)


Nice! But the nodes 5 and 6 does not yet have any data. We need to run alter table t1 reorganize partition; to reorganize the data.

mysql> alter table t1 reorganize partition;
Query OK, 0 rows affected (4 min 7.97 sec)
Records: 0 Duplicates: 0 Warnings: 0

Moreover, alter table .. reorganize partition; is quite resource intensive (see picture at the end), so adding nodes online is only recommended at off-peak times:

Next I will look into how much reads/writes are affected by the reorg.


Verification
To verify that the data has been reorganized correctly, we can use ndb_desc.
Before adding nodes and ALTER TABLE ... REORGANIZE PARTITION;

[root@computer1 scripts]# ndb_desc -p -d test t1
-- t1 --
...
Version: 2
...
TableStatus: Retrieved
-- Attributes --
a Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR
b Unsigned NULL AT=FIXED ST=MEMORY
c Timestamp NOT NULL AT=FIXED ST=MEMORY
d Longvarbinary(1024) NULL AT=MEDIUM_VAR ST=MEMORY

-- Indexes --
PRIMARY KEY(a) - UniqueHashIndex
c(c) - OrderedIndex
PRIMARY(a) - OrderedIndex

-- Per partition info --
Partition Row count Commit count Frag fixed memory Frag varsized memory
0 374952 374952 16613376 261521408
2 375146 375146 16613376 261685248
4 375351 375351 16613376 261816320
6 374979 374979 16613376 261357568
3 374368 374368 16580608 261685248
5 374641 374641 16580608 261816320
1 375378 375378 16613376 261816320
7 375185 375185 16613376 261685248

NDBT_ProgramExit: 0 - OK

Each partition has about 375000 rows.

After ALTER TABLE ... REORGANIZE PARTITION;
Verifying that the data is distributed as expected (check column 2, "row count"):

[root@computer1 scripts]# ndb_desc -p -d test t1
-- t1 --
...
Version: 16777217
...
-- Attributes --
a Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY
b Unsigned NULL AT=FIXED ST=MEMORY
c Timestamp NOT NULL AT=FIXED ST=MEMORY
d Longvarbinary(1024) NULL AT=MEDIUM_VAR ST=MEMORY

-- Indexes --
PRIMARY KEY(a) - UniqueHashIndex
c(c) - OrderedIndex
PRIMARY(a) - OrderedIndex

-- Per partition info --
Partition Row count Commit count Frag fixed memory Frag varsized memory
0 187296 750264 16613376 261521408
2 187426 750586 16613376 261685248
4 188058 749937 16613376 261816320
6 187238 750461 16613376 261357568
1 187847 750440 16613376 261816320
3 187602 747900 16580608 261685248
5 187392 749139 16580608 261816320
7 187610 750335 16613376 261652480
8 187656 187656 8323072 131006464
12 187293 187293 8290304 130613248
10 187720 187720 8323072 130842624
14 187741 187741 8323072 130973696
9 187531 187531 8323072 130809856
13 187249 187249 8290304 130875392
11 186766 186766 8290304 130678784
15 187575 187575 8323072 130875392

Each partition has about 187000 rows.
As you can see above, all partitions that were added (partition 8-15) and the previous parititons (0-7) have the same row count (more or less).

CPU UTIL

Below is a picture showing the CPU utlization during the reorg:


And for each thread:

Wednesday, March 18, 2009

MySQL Cluster 7.0.4 - online add node scripts

MySQL Cluster 7.0.4 (beta, a and formerly known as 6.4) was released yesterday in source format. The new versions fixes a number of bugs in e.g the areas of online add node and parition pruning for some index scans.

Morveover Configurator has been updated (and all the scripts have got improved error handling and usability) to support add nodes with a procedure according to the docs.

To add a new node group using the Configurator you do like this:
  1. Make sure you have a MySQL Cluster 7.0 running.
  2. cd mysqlcluster-70/cluster/scripts
  3. Add the two new data nodes to the config.ini
    • vi ../config/config.ini
    • There are already two empty data nodes with Id=5, and Id=6 (if you have defined a two ndb node cluster in the Configurator).
    • Remove the # at the start of the line.
    • Fill in the hostname in Hostname=
      e.g, you have added the two hostnames "A" and "B"
  4. sh ./add_nodes.sh A,B
    • The script will restart all the nodes (management servers, mysql servers, and data nodes) , copy out binaries to the added nodes, start them, and finally create a new node group.
    • For a two node cluster you have to wait about 4 minutes and 50 seconds now.
  5. Now you have to do ALTER ONLINE TABLE table_name REORGANIZE PARTITION on all you NDB tables.

Monday, March 16, 2009

Memory (de)allocation,(de)fragmentation, and chkfrag!

(this blog post applies to MySQL Cluster 6.3 and MySQL Cluster 6.4/7.0)

A while ago I discussed Memory Allocation in MySQL Cluster. Here it comes again, but we also discuss fragmentation and a tool to help analyze if there is fragmentation.

The main problem I want to address here is that "I filled up 100% of my DataMemory, then deleted 40% and I cannot create a new table!!". Now this is not such a common problem, but can happen in systems with a lot of insert/deletes (especially on VAR* columns), and can be an issue if you need to create a new table. If the allocation patterns are always on the same set of tables, then this is less of an issue.

Allocation
  1. If the table is empty and there is an insert - one page will be allocated from the page pool (DataMemory) and associated with that table.
  2. Subsequent inserts will fill up the data page until it is full.
  3. NEW: Attributes in a table are divided into a fixed size part, and a variable part (if the table has VAR* or even DYNAMIC attributes). The fixed size part is stored together with the record header and from there is a reference to the VAR*/DYNAMIC part. The VAR*/DYNAMIC part and the FIXED part are stored on different pages and are never mixed!
  4. The allocator allocates one page at a time!
  5. If there is a hole on an already allocated page, and this hole fits the data that should be inserted, then this hole will be used.
  6. 5) means that space within a page that is associated with a table will be reused.
Deallocation
  • From MySQL Cluster 6.3 there is per page de-allocation, so an empty page after a DELETE will release the pages to the DataMemory page pool.
  • TRUNCATE/DROP TABLE will release the pages to the DataMemory page pool.
Fragmentation/Defragmentation
  • The pages will be fragmented when you insert/delete record.
  • OPTIMIZE TABLE can help to defragment pages within a table, and free (empty) pages will be put on the page pool (see Deallocation above).
  • In MySQL Cluster - rolling restarts of the data nodes will not defragment the FIXED size pages, but VAR*/DYNAMIC pages will be defragmented.
But how do you know if you have fragmentation or not? A new tool, chkfrag (beta) can help you.

Back to the problem: "I filled up 100% of my DataMemory, then deleted 40% and I cannot create a new table!!". Why is that?

Pretened that we have a MySQL Cluster up and running. We have only four pages and all of them are being used (figure 1):


We cannot create a new table at this stage, because we don't have an free pages that can be used to store records.

Now we delete 40% of the records. This frees up space on the pages (shown in figure 2). We can still not create a new table, but if we can of course insert new records into t1 provided they fit. OPTIMIZE TABLE can help to reduce fragmentation.
In its current form and foreseeable future, OPTIMIZE TABLE, will only compact the VAR* pages. Fixed size will not be compacted. And frankly, it is not great on defragmentation either. A number of IFs and buts about record sizes etc mak
es it rather inefficient (in most cases). It would be good to have the option to run it more aggressive. Figure 3 shows what it could look like after OPTIMIZE TABLE.
Currently, the most efficient to defragmentation method is simply to perform a rolling restart (figure 4).. and chkfrag let's you know if you need to!
Example
Below follows an example on fragmentation and how chkfrag can help:

CREATE TABLE `t5` (
`id` varchar(32) CHARACTER SET utf8 NOT NULL,
`name` varchar(32) NOT NULL DEFAULT '',
`unit` int(11) DEFAULT NULL,
PRIMARY KEY (`id`,`name`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1

Then we use hugoLoad (a test program that isn't built by default in the src distribution of cluster) to load it with 1M records.

./hugoLoad -r 1000000 -d test t5

Checking fragmentation:
johan@homecomputer:~/dev/chkfrag$ src/chkfrag
Gathering information how much DataMemory is being used
Used DataMemory calculated: 338MB 10816 pages (32KB)
Used DataMemory reported: 324MB 10368 pages (32KB)
Total DataMemory reported: 512MB 16384 pages (32KB)
Percentage of VAR* in database: 42.01 percent
There is no fragmentation!

In this case there is a discrepancy between Calculated DataMemory and real data memory.
That is ok, but when multiplying the difference with the percentage of VAR* in the database, then the fragmenation is too small, i.e, less than 10MB. Btw, this value is taken out of the blue..so there is no science with that.

deleting 200000 random records and checking fragmentation again:

Gathering information how much DataMemory is being used
Used DataMemory calculated: 270MB 8640 pages (32KB)
Used DataMemory reported: 324MB 10368 pages (32KB)
Total DataMemory reported: 512MB 16384 pages (32KB)
Percentage of VAR* in database: 41.85 percent
Fragmentation of VAR* in database: 16.67 percent
Possibly lost memory (locked in by tables): 22.60MB

Recommendation:
1. Run OPTIMIZE TABLE on tables with a lot of VARCHAR/VARBINARY.
If you are not happy with the result of this then do step 2)!
2. Perform a rolling restart.

Running the OPTIMIZE TABLE:

mysql> optimize table t5;
+---------+----------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------+----------+----------+----------+
| test.t5 | optimize | status | OK |
+---------+----------+----------+----------+
1 row in set (3 min 42.44 sec)

Checking the fragmentation:

Gathering information how much DataMemory is being used
Used DataMemory calculated: 270MB 8640 pages (32KB)
Used DataMemory reported: 324MB 10368 pages (32KB)
Total DataMemory reported: 512MB 16384 pages (32KB)
Percentage of VAR* in database: 41.85 percent
Fragmentation of VAR* in database: 16.67 percent
Possibly lost memory (locked in by tables): 22.60MB

Recommendation:
1. Run OPTIMIZE TABLE on tables with a lot of VARCHAR/VARBINARY.
If you are not happy with the result of this then do step 2)!
2. Perform a rolling restart.


Hmm... OPTIMIZE TABLE does not work so great (i have however seen it reclaimin some pages sometimes..).
And after a rolling restart:

Gathering information how much DataMemory is being used
Used DataMemory calculated: 270MB 8640 pages (32KB)
Used DataMemory reported: 272MB 8704 pages (32KB)
Total DataMemory reported: 512MB 16384 pages (32KB)
Percentage of VAR* in database: 41.48 percent
Fragmentation of VAR* in database: 0.74 percent
Possibly lost memory (locked in by tables): 0.83MB

We have reclaimed 324MB-272MB=52MB!

Conclusion:
A dba should never have to perform a rolling restart in order to defragment (and arguably in no other case either). OPTIMIZE TABLE should be improved and perhaps have a "FULL" flag, indicating how thorough it should be.

Sunday, March 01, 2009

bencher - a benchmarking utility for MySQL Cluster

bencher is a test program that allows you to benchmark requests on MySQL Cluster. I have used this utility a lot of customers, because it lets me:
  • specify a simple query that I want to benchmark on the command line
  • implement more complex use cases.
  • implement NDBAPI requests
and I don't have to reinvent the wheel every time. It is all there: connectivity, multi-threading support, timers, and some basic statistics, and it compiles on most platforms. I just have to focus on the queries I want to optimize or benchmark.

The simple use case is to specify the SQL query you want to benchmark, the number of threads, and how many times. You can also customize this very easily to benchmark more elaborate SQL requsts, and NDBAPI requests.

bencher outputs per thread statistics and total throughput:

./src/bencher -s /tmp/mysql.sock.3306 -t 2 -l 10000 -q "select * from t1 limit 1"

------- Starting Benchmark ----------

Thread 1 - 638 qps (average qps measured after 5 secs)
Thread 0 - 631 qps (average qps measured after 5 secs)
Thread 1 - 680 qps (average qps measured after 10 secs)
Thread 0 - 679 qps (average qps measured after 10 secs)

------- Benchmark Finished ----------

Thread 0 - max: 83091 us, min 668 us, less than 5 ms: 9761 of 10000, avg: 1485 us, total time: 14949 ms, qps: 668.91
Thread 1 - max: 43743 us, min 578 us, less than 5 ms: 9770 of 10000, avg: 1475 us, total time: 14767 ms, qps: 677.16

Total throughput = 1346.08 qps
Average exec time per thread = 14.86 secs


You can also specify a "querytime-threshold", to see how many transaction have executed under a certain time (default is 5 ms). From the above you can see that for this particular query 9760/10000 requests finished within 5 ms.