Monday, February 11, 2008

MySQL Cluster Features - what they are and what they do

There are currently four derivates of MySQL Cluster 5.1:
  • MySQL Cluster 5.1 - discontinued, don't use. It was an early version
  • MySQL Cluster 6.2
  • MySQL Cluster 6.3
  • MySQL Cluster 7.0
In this post I will explain what features exists where, what they do and what they mean and how they can be used.

The way to read it is as follows: Whatever is included in MySQL Cluster 5.1, is also in MC 6.2 and 6.3. Whatever is in 6.2 is also in 6.3 and so on.

What version you should use depends on what features you need. I recommend to use MySQL Cluster 7.0 (and the latest release of it).

Features in MySQL Cluster 5.1 - discontinued

Feature

Category

Included in version and later

Disk data support Kernel/MySQL 5.1
Asynchronous replication (geo redundancy) Kernel/MySQL 5.1
Variable size attributes Kernel/MySQL 5.1
Optimized Node Recovery Kernel 5.1
Online create index Kernel/MySQL 5.1
Batching of Insert/Select MySQL 5.1
Distribution awareness (direct APIs) Direct APIs 5.1

Features in MySQL Cluster 6.2

Feature

Category

Included in version and later

255 nodes (increased Cluster size) Kernel MC 6.2
Page allocation Kernel MC 6.2
Online Alter Table Add Column Kernel/MySQL MC 6.2
Micro GCP Replication MC 6.2
Slave batching Replication MC 6.2
Multiconnection optimization MySQL MC 6.2

Features in MySQL Cluster 6.3

Feature

Category

Included in version and later

Realtime extensions Kernel MC 6.3
Compressed Backup/LCP Kernel MC 6.3
Two Local Checkpoints Kernel MC 6.3
Parallel Node Recovery Kernel MC 6.3
Page de-allocation Kernel MC 6.3
Epoll Kernel MC 6.3
Optimize table Kernel/MySQL MC 6.3
Conflict detection/resolution Replication MC 6.3
Batching of Update/Delete MySQL MC 6.3
Distribution awareness (MySQL) MySQL MC 6.3

Features in MySQL Cluster 7.0

Feature

Category

Included in version and later

Multithreaded datanodesKernelMC 7.0
Online add node KernelMC 7.0
MS Windows Support (alpha)Kernel MC 7.0

Realtime extensions

The Realtime extensions make it possible to set realtime priority to the data node process and control on which CPU cores it should operate.
In environments with strong requirements on response times it is usually a good idea to enable the realtime extensions and tune the SchedulerExecutionTimer and SchedulerSpinTimer. Note - SchedulerExecutionTimer and SchedulerSpinTimer - does not have any effect if using MC70 and the multithreaded data nodes (ndbmtd).
Preliminary benchmarks (to be published here soon) indicates significantly lower response times (40% lower) and much more predictable response times.

Below is a baseline that can be set in config.ini (please not that I have left out the cpuids here. See below):

RealtimeScheduler=1
LockExecuteThreadToCPU=[cpuid]
LockMaintThreadsToCPU=[cpuid]
SchedulerExecutionTimer=80 (ms)
SchedulerSpinTimer=400 (ms)

Which CPU cores to bind to is very dependent on the enviroment (kernel, network adapters etc). Use cat /proc/interrupts to figure out which CPU ID that does not generate bulk of the interrupts on the network adapters.
In 7.0 and later, the LockExecuteThreadToCPU can be a list of cpus: E.g

LockExecuteThreadToCPU=0,2,4-7

or

LockExecuteThreadToCPU=0-3,5-7

or

LockExecuteThreadToCPU=0-7


Set LockExecuteThreadToCPU to lock the data node's main execution thread to that CPU. Lock the maintenance threads to some other CPU.

Compressed Backup/LCP

The space requirements of the backups and LCPs can be lowered by compressing the Backup and LCP.
However, compression costs CPU cycles and should only be used if the maintenance threads (file system threads) have been bound to a dedicated core, see (Realtime extensions). Enabling compression requires about 60% cpu time on that core.

The space saving is about 42% on completely random data. Thus in many cases it is probably more. Saving space is usually important in ATCA platforms.

To enable it, set the following in config.ini: (0=disable (default), 1=enable):

CompressedBACKUP=0|1
CompressedLCP=0|1

Parallel Node Recovery

Parallel node recovery enables more than one node at a time to perform node recovery. This means that all data nodes will restart faster in the case of multi-node failure.

Two LocalCheckpoints

Traditionally the data nodes are writing three LCPs before recycling the REDO-log. This features makes it necessary to write two LCPs before the REDO-log is recycled.

Disk Data Support

Too much to write here. Please see the reference manual.

Epoll

epoll is a new and improved way of handling many file descriptors. It is a cheaper way than using select/poll to find out if a fd has data to be read. Instead of doing a select and scanning through the fds (done by the OS), epoll can do this in constant. epoll is specific to Linux and has other names on e.g Solaris and FreeBSD. Currently, MySQL Cluster only supports this kind of functionality on Linux.

Optimize table

Current version of OPTIMIZE TABLE can defragment records in table which has DYNAMIC attributes (VARCHAR, VARBINARY , but also columns created with COLUMN_FORMAT DYNAMIC such as INT, BIGINT) because DYNAMIC attributes are stored on a separate page compared to FIXED attributes. Thus tables which has DYNAMIC attributes and are subject to INSERT/DELETE will be defragmented using OPTIMIZE TABLE, and freed pages will be returned to the page pool (DataMemory) and can then be reused by another table.

Page de-allocation

The memory allocator allocates one page at the time from the Page pool (DataMemory) to the table that needs it. If a page associated to a table becomes empty (e.g by DELETE) it will be returned to the page pool and can be reused by another table or the same. In earlier versions (5.0,5.1, 5.1 MC 6.2), DELETE does not release the pages associated with a table. DROP TABLE returns all pages to the page pool.
See Memory allocation and deallocation in MySQL Cluster for more details.


Asynchronous Replication (Geo Redundancy)

Use asynchronous replication between Cluster that are geographically separated in order to increase redundancy even more. In MC 6.3 it is also possible to have master-master replication with conflict detectiona and resolution. Please see the reference manual.

Variable size attributes

VARCHAR and VARBINARY are now variable sized. This means that e.g a VARCHAR(255) will only occupy the amount of space that are inserted into it.
In 5.0 a VARCHAR(255) occupies 256B of storage no matter if you store only one byte in it.

Optimized Node Recovery

Optimized Node Recovery means that a node that is restarted as ndbd will copy only changed that from the other started node in its node group (partition).
Thus, the faster the node is restarted after a node failure, the faster it will recover since it will be less changes to copy.
Restarting the node as ndbd --initial will result in that the data node's file system will be deleted and all data will be copied from the other node in the node group.

Online create index

CREATE INDEX 'indexname' ON t1(col1,..,coln);Online create index means that this index will be created online, i.e. no ALTER TABLE is needed and transactions can still use the table while the index is created.

255 nodes (increased Cluster size)

Maximum number of nodes has been increased from 63 to 255. However, max number of data nodes are still 48.

Page allocation

New memory allocator for ndb kernel which allocates memory to a table page by page (32 K bytes per page). This removes the high memory overhead with the old memory allocator.
See Memory allocation and deallocation in MySQL Cluster for more details.

Online alter table add column

It is possible to add columns to the end of the table. It is not possible to add BLOBs and disk data columns online. If the ALTER TABLE ONLINE cannot be performed online a warning message will be printed and it will not perform the ALTER TABLE.
Example usage:

alter online table t1 add column g int;

Micro GCP

Used in conjuction to asynchronous replication. Micro GCP controls how often epochs should be generated. An epoch is a consistent group of committed transactions.
If the MySQL server is generating a binary log, this group of committed transactions written in the binary log and sent as a unit to the slave mysql server.
Normally the epoch is generated every TimebetweenGlobalCheckpoints when a Global CheckPoint is generated (GCP), which is by default every 2 seconds.
This means that the epochs sent between the master and slave can be big, and also latency (replication lag) is effected since data between the master and slave will be sent effectively every TimebetweenGlobalCheckpoints.

Micro GCP alleviates that by chunking up the Global checkpoint into smaller parts. The parameter that controls how big the chunks are is TimeBetweenEpochs. Setting TimeBetweenEpochs=100 [ms] in config.ini usually works very well and this means that epochs are smaller and sent to the slave mysql server every 100ms. Thus, setting this lowers the replication lag significantly.
Use this also with Slave Batching.

Slave Batching

Used in conjuction to asynchronous replication. Instead of the slave applying one record at a time, Slave Batching allows an entire epoch (see Micro GCP) consisting one or more transactions. It is highly recommended to enable setting when using asynchronous replication.
[mysqld]
...
#1 -- means enable, 0 -- means disable
slave-allow-batching=1
...

Conflict detection/resolution

To much to write here. Please see the reference manual.

Multi-connection optimization

Multi-connection optimization means that one MySQL server can make several connections(ndb cluster connections) to the data nodes.
This means that the MySQL server can load balance requests on several ndb cluster connections and get around some mutex contention problems.
Depending on queries this gives from 70% and more improvement on throughput from the MySQL server. This feature can be activated by setting the following in my.cnf:

[mysqld]
...
ndb-cluster-connection-pool=10
...

The above will create 10 ndb cluster connections to the data nodes from this MySQL server and will require 10 free [mysqld] slots in the config.ini.
A good value is obtained by benchmarking your system and depends on the number of connections made to the MySQL server from client APIs.

Batching of Update/Delete

update t1 set y=y+1 where pk_key_part in (1,2,...)
delete from t1 where pk_key_part in (1,2,...)

Batching of Insert/Select

insert into t1 values(1,1), (2,2), ... (n,n);
select * from t1 where pk_key_part in (1,2,...);

Distribution awareness (MySQL)

Distribution Awarness is a mechanism for the MySQL server to select the correct data node to query for information. Consider following example:

CREATE TABLE t1 ( userid INTEGER,
data VARCHAR(255),
PRIMARY KEY(userid)) ENGINE=NDB PARTITION BY KEY (userid);


CREATE TABLE t2 ( userid INTEGER,
serviceid INTEGER AUTO_INCREMENT,
data VARCHAR(255),
PRIMARY KEY(userid, serviceid)) ENGINE=NDB PARTITION BY KEY(userid);

  • Table t1 will be distributed according to its primary key, userid, whichis the default way of partitioning data.

  • Table t2 will be distributed according to userid, even though the full primary key is (userid,serviceid).
    This means that a records with the same userid will be stored in the same partition (node group). Moreover, when e.g reading data the MySQL server will start transactions on the data node having the data instead of using a round-robin mechanism to select which data node to start the transaction on. However, this only has an impact on performance when having >= 4 data nodes. With two data nodes, each data node will have the same data.
    Benchmarks indicates that using Distribution Awareness yields about 45% performance increase on four data nodes, but a lot more on bigger clusters.

Distribution awareness (directAPIs)

See above how to create tables. See Direct API documentation for details.

Multi-threaded data node

The data node can utilize 8 cores. The parameter MaxNoOfExecutionThreads determines how many cores the data node should use (1-8). If you have 8 cores, set MaxNoOfExecutionThreads=8. If you set MaxNoOfExecutionThreads to a value higher than the actual cores you have (e.g, MaxNoOfExecutionThreads=8 but you only have 4 cores) , then it can cause deadlocks and contentions and it is not recommended at all!

Online add node

You can add nodes online. This means you can grow your cluster from e.g 2 data nodes to 4 data nodes while you have traffic on the system (online). See here for more information.

Windows support (alpha)

MySQL Cluster runs on Windows!

12 comments:

Jesus said...

Johan,
I appreciate your information, I want to ask you about this issue. I need to setup and enviroment on one main server and four remote nodes. The main server would be only, i repeat only for receiving nodes updates. Node will be capturing information ( no informations conflicts would exist in spite the will be working on diferent information).

Now, does cluster setup could help me? I saw mysql replication, but I see that it is backwards of what I need .. it is from main to nodes..
I appreciate your help.

Jesus Granados

Johan Andersson said...

Hello Jesus,

i think you want to look at this:
http://johanandersson.blogspot.com/2009/04/multi-source-replication-with-mysql.html

This does exactly what you want to have.

Good luck,
j

Theodore Hope said...

Johan, I'm curious why in your example you set "ndb-cluster-connection-pool=10" (so high).

The MySQL doc says the following:
---
"This option is useful only when running mysqld on host machines having multiple CPUs, multiple cores, or both. For best results, the value should be smaller than the total number of cores available on the host machine. Setting it to a value greater than this is likely to degrade performance severely."
---

In other words, setting "ndb-cluster-connection-pool=10" would only make sense for an NDB node with more than 5 cores (e.g., 8 cores), is that right?

Johan Andersson said...

Hi Theodore,

10 was just an example, so people understand you must have 10 [MYSQLD] slots in config.ini.

But setting it to 2x usually works fine. Already setting it to a value of two gives a huge increase compared to the default 1.

Best regards,
johan

mrManuz said...

Hi Johan,
I thought the maximum number of nodes limit had to be improved (from 255 to x....) in mysql ndb 7.x

I saw your explanation about multithreaded datanodes (ndbmt) and I think it's another story (not related to break max num nodes limit).

Am I right ?
7.0.15b (for example) has the limit,yet ? 7.1.4 ?

thanks a lot .
Emanuele

the brinkman said...

Hi Johan, would it be possiblke to extend a table size using the ALTER TABLE ... PARTITION statement?

If so, would that be preferrable to creating a new data file using ADD TABLESPACE ... statement?

Any more notes would be very helpful seeing a the maximum table size for each partion (hence node?) is about 46million.

Thanks

the brinkman said...

here's the resource for max rows per partition:

http://dev.mysql.com/doc/mysql-cluster-excerpt/5.1/en/mysql-cluster-limitations-database-objects.html

Kavin said...

Hi Johan,

Thank you so much for this information. Very useful, but I am facing a problem with the MYSQL NDB Cluster setup.
I am using NBD CLuster 7.1.10 in Fedora 14.
My RAM size is 32 GB and using hexCore Processor.

I am inserting thousands of records per second to the DB. At some point, I am getting an error message saying "Table is Full".

When I tried changing the datamemory in config.ini, it is not allowing me to increase it more than 10 GB, if I increase, the node is not starting. I have a 2 node cluster and one cluster manger.

Could you please help me solving this issue.

Thank you so much in advance.

Johan Andersson said...

Hi,

you must restart the cluster when you have changed the DataMemory.

Restart the mangement server with --reload and then restart one data node at a time.

You can also you the cluster configurator to generate a production class MySQL Cluster config:

www.severalnines.com/config

Kavin said...

Thanks so much Johan. In fact I tried that even before searching around. What exactly happens in my case is, when the data crosses 10 GB, the Table is Full error message is thrown and the Memory is 100%. I am not surr why the memory is not released at all.

Any idea? Thanks again!

BTW, can U give me a link where I could download a working MYSQL Cluster version for Fedora 14 64 bit hex Core Server?

Andy C. said...

Hi Johan,

I need to ALTER TABLE on a large table (10m+ rows).

This is to change a TEXT filed to a long VARCHAR, as I know that TEXT is not very efficient in NDB.

Can you suggest the best way to do this without downtime or with very minimal downtime please?

Would ALTER TABLE work? Or only in Single User Mode? Even then, is it the quickest option, or would a dump and LOAD DATA INFILE be better?

Any other methods available?

Kind regards,
Andy

Johan Andersson said...

Hi,

You will need as much RAM as you orignial table.

I would recommend the following if it is a big table

1) use mysqldump and dump out the table
2) change the table def in the dump
3) drop old table
4) reload the dump


Please note that ALTER TABLE to change column size is not an online operation (this means that if someone writes data to the table while you alter, it may be there it may not be..).

Thus LOCK TABLES on all mysql servers and then do the ALTER if you have enough free DataMemory/IndexMemory to do the ALTER, but if you can afford to take the table offline, then i would do so and use mysqldump.