Monday, October 29, 2007

Data distribution and distribution awareness

How come performance degrade with more data nodes?
Some of you might have noticed that performance might actually drop slightly when you move from a two node cluster to a four node cluster. To understand this we need to understand how information is and tables are organized in MySQL Cluster. Moreover, I will also mention a bit on distribution awareness, a feature that can be used to minimze this performance drop in bigger configurations.

In MySQL Cluster a table divided into fragments and spread on the data nodes.
You have as many fragments (Fx) as you have data nodes (Nx).
If you have four data nodes, then a table will have four fragments:

Table T1= {F1, F2, F3, F4}

These fragments will be laid out on the data nodes as follows:

N1=F1P
N2=F2P
N3=F3P
N4=F4P

I have written FxP, where P indicates the Primary fragment and potential locking conflicts are resolved on the data node having the Primary fragment for the particular data.
But this does not give you any redundancy. For redundancy you need two two replicas (copies) of the fragments (NoOfReplicas=2 in config.ini) :

N1=F1P, F2S
N2=F2P, F1S
N3=F3P, F4S
N4=F4P, F3S

Some explanations:
  • Nx = data node x
  • S = secondary fragment (copy).
  • P = primary fragment, lock conflicts are handled.
  • Each data node has a transaction coordinator that is active.
  • Nodes sharing the same information is called a node group. NG0={N1, N2} shares the same information (same fragments), and NG1{N3, N4} shares the same information.

Transactions are started in a round-robin fashion on the cluster from the application, unless you use distribution awareness (startTransaction with a hint). Since transactions are started in round-robin, the transaction might be started on a node not having the particular data! When the the transcation coordinator (who is picked randomly by the application (mysql server of ndbapi program) receives a primary key request (read/write) it calculates a 128-bit hash on the primary key. The first 64-bits tells where in the priamry key hash table the entry should be stored and the second 64-bits identify which fragment the data should be located on. It could be a fragment residing on another node than the TC. If so, the TC has to forward the request (if it is a read it is forwared and the node having the data will reply to the application directly) to that fragment, or initiate the 2PC protocol to update both fragments that should have the data. If the update of a row identified by PK=1 hashes to F1, then both F1P and F1S has to be synchronously updated. The 2PC takes care of that and a little bit more about that later.

Pretend that you want to read data that is located in fragment F1:
  • If you do a read without a lock (CommittedRead), then we can read from both primary and secondary fragment, hence you have 50% chance on hitting the correct nodes (N1 and N2) having you data.
  • If you read with a lock, the we can only read from F1P, hence we have 25% chance of hitting the right node (N1)!
And if you want to write (delete, update, insert)
  • If you write to F1, then the nodes N1 and N2 are involved, since both copies have to be updated using the 2PC protocol.

The more nodes you have, then more node groups you will have and the probability that a read transaction will end up on a node that does not have the information increases.
For eight nodes, two replicas, you have four node groups and only 25% chance that a read without lock will go to the correct nodes.

Now, what you have seen is that performance degrades when you go from two data nodes to four nodes.
If you have two nodes, then a table is divided into two fragments:
Table T1= {F1, F2}

And the information is laid out as (with NoOfReplicas=2):
N1=F1P, F2S
N2=F2P, F1S

Pretend that you want to read data that is located in fragment F1.
  • If you do a read without a lock (CommittedRead), you have 100% chance starting the transaction on the nodes having the data (N1 and N2 both has the data in F1P or F1S)!
  • If you read with a lock, the we can only read from F1P, hence we have 50% chance of hitting the right node (N1)!
And if you write:
  • If you write to F1, then the nodes N1 and N2 are involved, since both copies have to be updated using the 2PC protocol. I will not explain the 2PC protocol much more than saying that the transaction coordinator (TC) (can be on any of the nodes) will send: Prepare phase: TC --> F1P --> F1S --> TC , Commit phase: TC --> F1S --> F1P --> TC.

In this case three different nodes can be involved in a four node cluster, but only two nodes can be involved in the two node cluster, thus reducing the number of messages sent between computers.

To summarize up to this point
  • Reads becomes more expensive with bigger cluster because the chance getting to the wrong nodes increases, thus an extra (and maximally one) network hop is required.
  • Writes becomes more a little bit more expensive since there are more network hops involved (the mysqld or the ndbapi/ndbj application starts the transaction on data node N3, but the data you are updating is on data nodes N1 and N2).

Distribution awareness

By using distribution awareness you can get around these problems and minimize the number of network hops. Distribution awareness ensures that the transaction is started on the node that has the data, or it can also be used to write data to a particular set of nodes.
  • Reads (with locks or without locks) will now have 100% chance getting to the right node
  • Writes will now always be handled by the data nodes in one particular node group.
  • Distribution awareness is supported in NDBAPI and NDBJ, is supported from the MySQL Server from MySQL 5.1.22 - 6.3.4 Carrier Grade Edition.
In my next post I plan to illustrate how to use these features from respective access method (SQL or NDBAPI).

3 comments:

Jim Dowling said...

I think it's confusing to reintroduce the old terminology of "fragments" (I never liked that term).

They're called partitions from 5.1 onwards, whether they're user-defined or generated by the MD5 hashing fn.

Viswa Yogi said...

Can you give a code example for achieving the distribution awareness when using/coding an application.

Dorie said...

Johan, This is completely off subject of your blog...sorry. I am doing some family research. My great great grandfather was named Johan Andersson. I am guessing that the name is common in Sweden, but wondered if we might be related. Do you know some of your background? I have a lot of info and could probably figure it out. If you are interested you can send me a note. I have a blog too. http://dorie-sliceoflife.blogspot.com/
We might possibly be distant cousins or something.

Dorie