Thursday, October 30, 2008

Optimizing Queries on Cluster

On the cluster mailing list (cluster@lists.mysql.com , thread "slow selects") there was recently a very good example of how to optimize queries on Cluster. Thanks to Nick Keefen for raising this problem and Jeff Sturm for the answer how to solve it!

In short the problem is that the Optimizer in the MySQL server does not get adequate statistics from the data nodes about table sizes, indexes etc. This makes the Optimizer clueless in some cases how to order tables in joins, and also in some cases which is the best indexes to use.

So here is the problem that Nick highlighted:

When the tables are stored in MyISAM:
mysql> SELECT v.vid_id, v.vid_title, u.user FROM phpfox_user AS u JOIN phpfox_videos AS v ON(v.vid_userid = u.id) limit 3;
3 rows in set (0.32 sec)

phpfox_user is about 100000 rows and phpfox_videos is 170000 rows large

Trying to run the same query on my cluster machine, i get
3 rows in set (20.47 sec)

Why is this? Let's look at the EXPLAIN of the queries:


MyISAM explain:

mysql> explain SELECT v.vid_id, v.vid_title, u.user FROM phpfox_user AS u JOIN phpfox_videos AS v ON(v.vid_userid = u.id) ORDER BY v.vid_time DESC LIMIT 3;
+----+-------------+-------+--------+---------------+---------
+---------+---------------------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------
+---------+---------------------+--------+----------------+
| 1 | SIMPLE | v | ALL | vid_userid | NULL | NULL | NULL | 135025 | Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | nktest.v.vid_userid | 1 | |
+----+-------------+-------+--------+---------------+---------
+---------+---------------------+--------+----------------+


NDB explain:

mysql> explain SELECT v.vid_id, v.vid_title, u.user FROM phpfox_user AS u JOIN phpfox_videos AS v ON(v.vid_userid = u.id) ORDER BY v.vid_time DESC LIMIT 3;
+----+-------------+-------+------+---------------+-----------
-+---------+-----------------+-------+------------------------
---------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+-----------
-+---------+-----------------+-------+------------------------
---------+
| 1 | SIMPLE | u | ALL | PRIMARY | NULL | NULL | NULL | 82124 | Using temporary; Using filesort |
| 1 | SIMPLE | v | ref | vid_userid | vid_userid | 4 | bb2_phpfox.u.id | 1 | |
+----+-------------+-------+------+---------------+-----------
-+---------+-----------------+-------+------------------------
---------+


Note the additional 'Using temporary' clause for NDB. Nearly all of the 20 seconds taken by the query fall at 'copying to temporary table'.


The problem is that the Optimizer has changed the order (in case of NDB) how tables should be joined!

Jeff Sturm replied to Nick with a remedy:


Looks like the query optimizer is putting the tables in the wrong order.
(This isn't uncommon in my experience, as NDB seems to have less
information than other engines to optimize queries at runtime.)

Try modifying your query with STRAIGHT_JOIN syntax, i.e.:

SELECT v.vid_id, v.vid_title, u.user FROM phpfox_videos AS v
STRAIGHT_JOIN phpfox_user AS u ON v.vid_userid = u.id ORDER BY
v.vid_time DESC LIMIT 3;


and the problem was solved. The query now runs in split second.
Moreover, in some cases you also want to use USE INDEX or FORCE INDEX.
There are situations where the Optimizer wants to use e.g index a_idx to resolve the query, but really index b_idx is the one it should use!
Thus you can use:
SELECT * FROM t1 FORCE INDEX(b_idx) , t2 WHERE ...

Good luck! And yes, one day the Optimizer will get the good statistics!

2 comments:

Dev MySql said...

Thanxs for the soultion.

I have a question you said

Note the additional 'Using temporary' clause for NDB. Nearly all of the 20 seconds taken by the query fall at 'copying to temporary table'

Can't we avoid this temporary usage

Johan Andersson said...

Hello,
Generally it is possible to avoid the Temporary table.
Do you have the possibility to send the query (+ tables (perhaps anonymous names)) and i can help you out. Perhaps we can also make it into another blog post since I believe it could benefit more people.

If that sounds ok, please email me at johan@severalnines.com