Monday, February 11, 2008

MySQL Cluster Features - what they are and what they do

There are currently four derivates of MySQL Cluster 5.1:
  • MySQL Cluster 5.1 - discontinued, don't use. It was an early version
  • MySQL Cluster 6.2
  • MySQL Cluster 6.3
  • MySQL Cluster 7.0
In this post I will explain what features exists where, what they do and what they mean and how they can be used.

The way to read it is as follows: Whatever is included in MySQL Cluster 5.1, is also in MC 6.2 and 6.3. Whatever is in 6.2 is also in 6.3 and so on.

What version you should use depends on what features you need. I recommend to use MySQL Cluster 7.0 (and the latest release of it).

Features in MySQL Cluster 5.1 - discontinued

Feature

Category

Included in version and later

Disk data support Kernel/MySQL 5.1
Asynchronous replication (geo redundancy) Kernel/MySQL 5.1
Variable size attributes Kernel/MySQL 5.1
Optimized Node Recovery Kernel 5.1
Online create index Kernel/MySQL 5.1
Batching of Insert/Select MySQL 5.1
Distribution awareness (direct APIs) Direct APIs 5.1

Features in MySQL Cluster 6.2

Feature

Category

Included in version and later

255 nodes (increased Cluster size) Kernel MC 6.2
Page allocation Kernel MC 6.2
Online Alter Table Add Column Kernel/MySQL MC 6.2
Micro GCP Replication MC 6.2
Slave batching Replication MC 6.2
Multiconnection optimization MySQL MC 6.2

Features in MySQL Cluster 6.3

Feature

Category

Included in version and later

Realtime extensions Kernel MC 6.3
Compressed Backup/LCP Kernel MC 6.3
Two Local Checkpoints Kernel MC 6.3
Parallel Node Recovery Kernel MC 6.3
Page de-allocation Kernel MC 6.3
Epoll Kernel MC 6.3
Optimize table Kernel/MySQL MC 6.3
Conflict detection/resolution Replication MC 6.3
Batching of Update/Delete MySQL MC 6.3
Distribution awareness (MySQL) MySQL MC 6.3

Features in MySQL Cluster 7.0

Feature

Category

Included in version and later

Multithreaded datanodesKernelMC 7.0
Online add node KernelMC 7.0
MS Windows Support (alpha)Kernel MC 7.0

Realtime extensions

The Realtime extensions make it possible to set realtime priority to the data node process and control on which CPU cores it should operate.
In environments with strong requirements on response times it is usually a good idea to enable the realtime extensions and tune the SchedulerExecutionTimer and SchedulerSpinTimer. Note - SchedulerExecutionTimer and SchedulerSpinTimer - does not have any effect if using MC70 and the multithreaded data nodes (ndbmtd).
Preliminary benchmarks (to be published here soon) indicates significantly lower response times (40% lower) and much more predictable response times.

Below is a baseline that can be set in config.ini (please not that I have left out the cpuids here. See below):

RealtimeScheduler=1
LockExecuteThreadToCPU=[cpuid]
LockMaintThreadsToCPU=[cpuid]
SchedulerExecutionTimer=80 (ms)
SchedulerSpinTimer=400 (ms)

Which CPU cores to bind to is very dependent on the enviroment (kernel, network adapters etc). Use cat /proc/interrupts to figure out which CPU ID that does not generate bulk of the interrupts on the network adapters.
In 7.0 and later, the LockExecuteThreadToCPU can be a list of cpus: E.g

LockExecuteThreadToCPU=0,2,4-7

or

LockExecuteThreadToCPU=0-3,5-7

or

LockExecuteThreadToCPU=0-7


Set LockExecuteThreadToCPU to lock the data node's main execution thread to that CPU. Lock the maintenance threads to some other CPU.

Compressed Backup/LCP

The space requirements of the backups and LCPs can be lowered by compressing the Backup and LCP.
However, compression costs CPU cycles and should only be used if the maintenance threads (file system threads) have been bound to a dedicated core, see (Realtime extensions). Enabling compression requires about 60% cpu time on that core.

The space saving is about 42% on completely random data. Thus in many cases it is probably more. Saving space is usually important in ATCA platforms.

To enable it, set the following in config.ini: (0=disable (default), 1=enable):

CompressedBACKUP=0|1
CompressedLCP=0|1

Parallel Node Recovery

Parallel node recovery enables more than one node at a time to perform node recovery. This means that all data nodes will restart faster in the case of multi-node failure.

Two LocalCheckpoints

Traditionally the data nodes are writing three LCPs before recycling the REDO-log. This features makes it necessary to write two LCPs before the REDO-log is recycled.

Disk Data Support

Too much to write here. Please see the reference manual.

Epoll

epoll is a new and improved way of handling many file descriptors. It is a cheaper way than using select/poll to find out if a fd has data to be read. Instead of doing a select and scanning through the fds (done by the OS), epoll can do this in constant. epoll is specific to Linux and has other names on e.g Solaris and FreeBSD. Currently, MySQL Cluster only supports this kind of functionality on Linux.

Optimize table

Current version of OPTIMIZE TABLE can defragment records in table which has DYNAMIC attributes (VARCHAR, VARBINARY , but also columns created with COLUMN_FORMAT DYNAMIC such as INT, BIGINT) because DYNAMIC attributes are stored on a separate page compared to FIXED attributes. Thus tables which has DYNAMIC attributes and are subject to INSERT/DELETE will be defragmented using OPTIMIZE TABLE, and freed pages will be returned to the page pool (DataMemory) and can then be reused by another table.

Page de-allocation

The memory allocator allocates one page at the time from the Page pool (DataMemory) to the table that needs it. If a page associated to a table becomes empty (e.g by DELETE) it will be returned to the page pool and can be reused by another table or the same. In earlier versions (5.0,5.1, 5.1 MC 6.2), DELETE does not release the pages associated with a table. DROP TABLE returns all pages to the page pool.
See Memory allocation and deallocation in MySQL Cluster for more details.


Asynchronous Replication (Geo Redundancy)

Use asynchronous replication between Cluster that are geographically separated in order to increase redundancy even more. In MC 6.3 it is also possible to have master-master replication with conflict detectiona and resolution. Please see the reference manual.

Variable size attributes

VARCHAR and VARBINARY are now variable sized. This means that e.g a VARCHAR(255) will only occupy the amount of space that are inserted into it.
In 5.0 a VARCHAR(255) occupies 256B of storage no matter if you store only one byte in it.

Optimized Node Recovery

Optimized Node Recovery means that a node that is restarted as ndbd will copy only changed that from the other started node in its node group (partition).
Thus, the faster the node is restarted after a node failure, the faster it will recover since it will be less changes to copy.
Restarting the node as ndbd --initial will result in that the data node's file system will be deleted and all data will be copied from the other node in the node group.

Online create index

CREATE INDEX 'indexname' ON t1(col1,..,coln);Online create index means that this index will be created online, i.e. no ALTER TABLE is needed and transactions can still use the table while the index is created.

255 nodes (increased Cluster size)

Maximum number of nodes has been increased from 63 to 255. However, max number of data nodes are still 48.

Page allocation

New memory allocator for ndb kernel which allocates memory to a table page by page (32 K bytes per page). This removes the high memory overhead with the old memory allocator.
See Memory allocation and deallocation in MySQL Cluster for more details.

Online alter table add column

It is possible to add columns to the end of the table. It is not possible to add BLOBs and disk data columns online. If the ALTER TABLE ONLINE cannot be performed online a warning message will be printed and it will not perform the ALTER TABLE.
Example usage:

alter online table t1 add column g int;

Micro GCP

Used in conjuction to asynchronous replication. Micro GCP controls how often epochs should be generated. An epoch is a consistent group of committed transactions.
If the MySQL server is generating a binary log, this group of committed transactions written in the binary log and sent as a unit to the slave mysql server.
Normally the epoch is generated every TimebetweenGlobalCheckpoints when a Global CheckPoint is generated (GCP), which is by default every 2 seconds.
This means that the epochs sent between the master and slave can be big, and also latency (replication lag) is effected since data between the master and slave will be sent effectively every TimebetweenGlobalCheckpoints.

Micro GCP alleviates that by chunking up the Global checkpoint into smaller parts. The parameter that controls how big the chunks are is TimeBetweenEpochs. Setting TimeBetweenEpochs=100 [ms] in config.ini usually works very well and this means that epochs are smaller and sent to the slave mysql server every 100ms. Thus, setting this lowers the replication lag significantly.
Use this also with Slave Batching.

Slave Batching

Used in conjuction to asynchronous replication. Instead of the slave applying one record at a time, Slave Batching allows an entire epoch (see Micro GCP) consisting one or more transactions. It is highly recommended to enable setting when using asynchronous replication.
[mysqld]
...
#1 -- means enable, 0 -- means disable
slave-allow-batching=1
...

Conflict detection/resolution

To much to write here. Please see the reference manual.

Multi-connection optimization

Multi-connection optimization means that one MySQL server can make several connections(ndb cluster connections) to the data nodes.
This means that the MySQL server can load balance requests on several ndb cluster connections and get around some mutex contention problems.
Depending on queries this gives from 70% and more improvement on throughput from the MySQL server. This feature can be activated by setting the following in my.cnf:

[mysqld]
...
ndb-cluster-connection-pool=10
...

The above will create 10 ndb cluster connections to the data nodes from this MySQL server and will require 10 free [mysqld] slots in the config.ini.
A good value is obtained by benchmarking your system and depends on the number of connections made to the MySQL server from client APIs.

Batching of Update/Delete

update t1 set y=y+1 where pk_key_part in (1,2,...)
delete from t1 where pk_key_part in (1,2,...)

Batching of Insert/Select

insert into t1 values(1,1), (2,2), ... (n,n);
select * from t1 where pk_key_part in (1,2,...);

Distribution awareness (MySQL)

Distribution Awarness is a mechanism for the MySQL server to select the correct data node to query for information. Consider following example:

CREATE TABLE t1 ( userid INTEGER,
data VARCHAR(255),
PRIMARY KEY(userid)) ENGINE=NDB PARTITION BY KEY (userid);


CREATE TABLE t2 ( userid INTEGER,
serviceid INTEGER AUTO_INCREMENT,
data VARCHAR(255),
PRIMARY KEY(userid, serviceid)) ENGINE=NDB PARTITION BY KEY(userid);

  • Table t1 will be distributed according to its primary key, userid, whichis the default way of partitioning data.

  • Table t2 will be distributed according to userid, even though the full primary key is (userid,serviceid).
    This means that a records with the same userid will be stored in the same partition (node group). Moreover, when e.g reading data the MySQL server will start transactions on the data node having the data instead of using a round-robin mechanism to select which data node to start the transaction on. However, this only has an impact on performance when having >= 4 data nodes. With two data nodes, each data node will have the same data.
    Benchmarks indicates that using Distribution Awareness yields about 45% performance increase on four data nodes, but a lot more on bigger clusters.

Distribution awareness (directAPIs)

See above how to create tables. See Direct API documentation for details.

Multi-threaded data node

The data node can utilize 8 cores. The parameter MaxNoOfExecutionThreads determines how many cores the data node should use (1-8). If you have 8 cores, set MaxNoOfExecutionThreads=8. If you set MaxNoOfExecutionThreads to a value higher than the actual cores you have (e.g, MaxNoOfExecutionThreads=8 but you only have 4 cores) , then it can cause deadlocks and contentions and it is not recommended at all!

Online add node

You can add nodes online. This means you can grow your cluster from e.g 2 data nodes to 4 data nodes while you have traffic on the system (online). See here for more information.

Windows support (alpha)

MySQL Cluster runs on Windows!