Monday, October 29, 2007

Data distribution and distribution awareness

How come performance degrade with more data nodes?
Some of you might have noticed that performance might actually drop slightly when you move from a two node cluster to a four node cluster. To understand this we need to understand how information is and tables are organized in MySQL Cluster. Moreover, I will also mention a bit on distribution awareness, a feature that can be used to minimze this performance drop in bigger configurations.

In MySQL Cluster a table divided into fragments and spread on the data nodes.
You have as many fragments (Fx) as you have data nodes (Nx).
If you have four data nodes, then a table will have four fragments:

Table T1= {F1, F2, F3, F4}

These fragments will be laid out on the data nodes as follows:

N1=F1P
N2=F2P
N3=F3P
N4=F4P

I have written FxP, where P indicates the Primary fragment and potential locking conflicts are resolved on the data node having the Primary fragment for the particular data.
But this does not give you any redundancy. For redundancy you need two two replicas (copies) of the fragments (NoOfReplicas=2 in config.ini) :

N1=F1P, F2S
N2=F2P, F1S
N3=F3P, F4S
N4=F4P, F3S

Some explanations:
  • Nx = data node x
  • S = secondary fragment (copy).
  • P = primary fragment, lock conflicts are handled.
  • Each data node has a transaction coordinator that is active.
  • Nodes sharing the same information is called a node group. NG0={N1, N2} shares the same information (same fragments), and NG1{N3, N4} shares the same information.

Transactions are started in a round-robin fashion on the cluster from the application, unless you use distribution awareness (startTransaction with a hint). Since transactions are started in round-robin, the transaction might be started on a node not having the particular data! When the the transcation coordinator (who is picked randomly by the application (mysql server of ndbapi program) receives a primary key request (read/write) it calculates a 128-bit hash on the primary key. The first 64-bits tells where in the priamry key hash table the entry should be stored and the second 64-bits identify which fragment the data should be located on. It could be a fragment residing on another node than the TC. If so, the TC has to forward the request (if it is a read it is forwared and the node having the data will reply to the application directly) to that fragment, or initiate the 2PC protocol to update both fragments that should have the data. If the update of a row identified by PK=1 hashes to F1, then both F1P and F1S has to be synchronously updated. The 2PC takes care of that and a little bit more about that later.

Pretend that you want to read data that is located in fragment F1:
  • If you do a read without a lock (CommittedRead), then we can read from both primary and secondary fragment, hence you have 50% chance on hitting the correct nodes (N1 and N2) having you data.
  • If you read with a lock, the we can only read from F1P, hence we have 25% chance of hitting the right node (N1)!
And if you want to write (delete, update, insert)
  • If you write to F1, then the nodes N1 and N2 are involved, since both copies have to be updated using the 2PC protocol.

The more nodes you have, then more node groups you will have and the probability that a read transaction will end up on a node that does not have the information increases.
For eight nodes, two replicas, you have four node groups and only 25% chance that a read without lock will go to the correct nodes.

Now, what you have seen is that performance degrades when you go from two data nodes to four nodes.
If you have two nodes, then a table is divided into two fragments:
Table T1= {F1, F2}

And the information is laid out as (with NoOfReplicas=2):
N1=F1P, F2S
N2=F2P, F1S

Pretend that you want to read data that is located in fragment F1.
  • If you do a read without a lock (CommittedRead), you have 100% chance starting the transaction on the nodes having the data (N1 and N2 both has the data in F1P or F1S)!
  • If you read with a lock, the we can only read from F1P, hence we have 50% chance of hitting the right node (N1)!
And if you write:
  • If you write to F1, then the nodes N1 and N2 are involved, since both copies have to be updated using the 2PC protocol. I will not explain the 2PC protocol much more than saying that the transaction coordinator (TC) (can be on any of the nodes) will send: Prepare phase: TC --> F1P --> F1S --> TC , Commit phase: TC --> F1S --> F1P --> TC.

In this case three different nodes can be involved in a four node cluster, but only two nodes can be involved in the two node cluster, thus reducing the number of messages sent between computers.

To summarize up to this point
  • Reads becomes more expensive with bigger cluster because the chance getting to the wrong nodes increases, thus an extra (and maximally one) network hop is required.
  • Writes becomes more a little bit more expensive since there are more network hops involved (the mysqld or the ndbapi/ndbj application starts the transaction on data node N3, but the data you are updating is on data nodes N1 and N2).

Distribution awareness

By using distribution awareness you can get around these problems and minimize the number of network hops. Distribution awareness ensures that the transaction is started on the node that has the data, or it can also be used to write data to a particular set of nodes.
  • Reads (with locks or without locks) will now have 100% chance getting to the right node
  • Writes will now always be handled by the data nodes in one particular node group.
  • Distribution awareness is supported in NDBAPI and NDBJ, is supported from the MySQL Server from MySQL 5.1.22 - 6.3.4 Carrier Grade Edition.
In my next post I plan to illustrate how to use these features from respective access method (SQL or NDBAPI).

Tuesday, October 16, 2007

Memory allocation and deallocation in MySQL Cluster

In this post I will try to explain how memory management works inside MySQL Cluster.
Memory management differs between different versions of MySQL Cluster.

The general idea is as follows:
  • Data (and ordered indexes) is stored in the DataMemory section. The DataMemory is organized in pages where each page is 32KB.
  • You can view the DataMemory as a pool of pages consisting of X number of 32 KB pages.
  • A table is associated with one or more pages from the DataMemory (pool of pages) depending on how much data is stored in the table.
That is the general stuff, now let's see how this is used.


MySQL Cluster 6.2/6.3/7.0
Except a number of optimizations and new features the memory allocator has also been revamped and now allocates pages on a per page basis.

Allocation
  1. If the table is empty and there is an insert - one page will be allocated from the page pool (DataMemory) and associated with that table.
  2. Subsequent inserts will fill up the data page until it is full.
  3. The allocator allocates one page at a time!
Deallocation
  • Currently, a delete resulting in a free page, will not return the free page to DataMemory page pool! It will still be associated with the table it was allocated to. Thus, a "DELETE FROM table1" will not free up any pages. The pages are still bound to the table. From MySQL Cluster 6.3 there is per page de-allocation.
  • truncate/drop table will release the pages to the DataMemory page pool.
  • Rolling restarts of the data nodes will not defragment empty pages, except for pages that store VAR* attributes (VAR* attrs are stored separate from fixed size attributes).
MySQL 5.0 (don't use)

Allocation
  1. If the table is empty and there is an insert - one page will be allocated from the page pool (DataMemory) and associated with that table.
  2. Subsequent inserts will fill up the data page until it is full.
  3. When the page is full the memory allocator allocating pages to a table will grow the number of pages associated with the table with 18.75%! This means that if a table is using 100 pages and all those pages are full, then the next insert will force the allocator to allocate another 18.75 (19 rounded up) pages from the DataMemory page pool to the table, resulting in that the table now has 119 pages associated with it.
Moreover, you can have free DataMemory (if you look using ndb_mgm -e "all dump 1000" , then check the cluster log what it prints out) , but still fail to do one insert with the error message "Table full", because the allocator cannot grow the table with 18.75%, because there are not enough free pages in the DataMemory page pool!!

E.g, the table is currenty uses 1000 pages, next allocation will force the table to allocate 200 additional pages, but there are only 199 pages available in the DataMemory page pool! Then you will get the "Table full" error message. However, another table may of course be able to do the allocation.

Best practice is to never fill up your database with more than 80% of the DataMemory because of this, and the fact that during System Restart the data nodes may recover the data in another order than you inserted it (thus the alllocator may have succeeded to insert the data into the database, but during recovery the allocation pattern may, and is likely to, be different), thus it may be impossible to recover your data in case of a cluster failure.

Deallocation
  • A delete resulting in a free page, will not return the free page to DataMemory page pool! It will still be associated with the table it was allocated to. Thus, a "DELETE FROM table1" will not free up any pages. The pages are still bound to table1, and subsequent inserts will be made on the pages already associated with the table.
  • truncate/drop table will release the pages to the DataMemory page pool.
  • In 5.0 - rolling restarts of the data nodes will reclaim and defragment empty pages.
  • In 5.1 - rolling restarts of the data nodes will not defragment empty pages. This has to do with a new faster node recovery protocol, called Optimized Node Recovery.
Please note that you can use ndb_mgm -e "all report mem" to view the memory utilization.

Thanks Jonas and Tomas at MySQL for your input on this.