"Performance will go down when number of data nodes increases on MySQL Cluster setup"
Have you ever heard such rumor? This is a little bit true, but mostly false. Such performance disadvantages are caused by badly designed schema in most cases. Schema design is really important in order to bring out MySQL Cluster performance.
MySQL Cluster stores and distribute rows amongst node groups according to PRIMARY KEY per row. If there is no bias on its PRIMARY KEY values, tables are distributed evenly. MySQL Cluster stores data like sharding.
Lookups based on PRIMARY KEY is lightning fast on MySQL Cluster, because the SQL node knows which data node have the target row. On the other hand, if a table is accessed without PK, the SQL node doesn't know where the target rows are stored and should access all the data nodes, thus performance will go down. Look at two figures below:
When looking-up by PK, the SQL node send a single request to the data node which has the row, and the data node responds. However, looking-up without PK specified, the data node which receives a request transfers the request to other data nodes. This increases the network traffic between the nodes significantly. (These two operations are handled completely differently within NDBAPI.)
When a SQL nodes starts a transaction, it select one data node as a transaction coordinator, TC. TC is responsible for the transaction until the transaction finishes, and all communications between SQL node and data node go through TC. So, even if lookups are based on PK, additional communications between data nodes may be required when accessing second or later tables within the transaction. See the figure below.
These problems will be exposed when the number of data nodes is increased. Then, the network will become the bottle-neck. The former case causes many network traffics, and the latter case could be impacted when more than one tables are updated within a transaction.
MySQL Cluster, as of 6.3, implements a feature which can mitigate these problems - so called Distribution Awareness. With Distribution Awareness, developers can design the schema so that they are distributed by any keys included in PK. We can expect that tables are accessed more efficiently with Distribution Awareness.
We can see the effect of Distribution Awareness with benchmark tests such as DBT-2. When I tested its effect on my PC, I got the following result:
This benchmark is measured using a single machine, and data nodes are created on the same OS instance. While this benchmark doesn't reflect the real world's performance because the cluster is configured with a single machine, it would help understanding how Distribution Awareness impact the performance.
Benchmark environment overview:
- OS:Ubuntu 8.10
- MySQL Cluster 6.3.23
- CPU:AMD/2 cores/2.3GHz
- Physical Memory:8GB
- HDD:SATA 7200rpm
- Data Nodes:10 (5 node groups)
- SQL Node:1
- Benchmark Software: DBT-2/# warehouses=5
Each result stands for the following.
This is just for comparison. MySQL Server is configured with innodb_buffer_pool_size=4G so that all data is cached in the memory.
NDB w/o DA
Distribution Awareness does not take effect. Data is loaded as MyISAM firstly, then they are simply converted to NDB using ALTER TABLE tbl_name ENGINE NDB statement.
Distribution Awareness takes effect. Most of tables on DBT-2 test have a column refers PK on `warehouses` table, and such a column is included in PK. So, I altered the table using the following statements so that they are distributed by warehouse id.
mysql> alter table customer engine ndb partition by key (c_w_id);
mysql> alter table district engine ndb partition by key (d_w_id);
mysql> alter table history add h_rowid int unsigned not null auto_increment, add primary key (h_rowid, h_w_id);
mysql> alter table history engine ndb partition by key (h_w_id);
mysql> alter table item engine ndb;
mysql> alter table new_order engine ndb partition by key (no_w_id);
mysql> alter table order_line engine ndb partition by key (ol_w_id);
mysql> alter table orders engine ndb partition by key (o_w_id);
You can distribute rows amongst data nodes based on a given key using PARTITION BY KEY clause. As the key should be included in the explicitly defined PK in order to use PARTITION BY KE clause, I added a new AUTO-INCREMENT column to `history` table.
NDB w/full DA
While `item` table is not correlated to `warehouses` table at all, Distribution Awareness will not take effect when `item` table and other tables are accessed within a single transaction. In order to make Distribution Awareness always effective, I converted `item` table and `warehouse` table to MyISAM.
mysql> alter table item engine myisam;
mysql> alter table warehouse engine myisam;
NDB single node
This is just for comparison, with the following configuration.
You can see a big difference with or without Distribution Awareness. The benchmark was best when all transactions are fully distribution aware, which was achieved by converting an uncorrelated table to MyISAM. While `item` is a kind of read-only table, it can be deployed as a MyISAM table practically in production.
MySQL Cluster's shared nothing architecture is very similar to sharding. So, when tables are defined like sharding and accessed like sharding, MySQL Cluster demonstrates the best performance. For example, sharded application accesses its underlying database using user-id by determining which database node has required data. In that case, user-id is always required when accessing a database. Similarly, in order to make SQL query distribution aware on MySQL Cluster, condition like "AND userid = 12345" should be always specified.
If Distribution Awareness does not take effect, the network will become a bottle-neck when number of data nodes increases. This causes performance degradation when data nodes increase. On the other hand, if Distribution Awareness takes effect properly, the network may rarely be the bottle-neck in this way. So, the performance is likely to scale as data node increases. So, whether tables and queries can be Distribution Aware or not is a very important point to determine if you will use MySQL Cluster or not in your application.
You can see if Distribution Awareness is effective or not using EXPLAIN PARTITIONS statement. If only a single partition is listed, the query is distribution aware!
mysql> EXPLAIN PARTITIONS SELECT no_o_id FROM new_order WHERE no_w_id = 2 AND no_d_id = 7\G
*************************** 1. row ***************************
partitions: p5 <--- OK if only a single partition is listed like this
1 row in set (0.00 sec)
The benchmark is done on a single machine. So, communications between nodes are faster than the real case because they do not go through a network. If the cluster is configured using multiple machines, difference between with or without Distribution Awareness will be more significant.
You may think like "MySQL Cluster is messy because I should take care of its distribution... Sharding is easier than it for me, isn't it?" However, you have to implement your own logic to separate load between databases when sharding, on the other hand, on MySQL Cluster, you can achieve the same thing by tuning your tables and queries a little bit. So the trade off is "Implementing own sharding logic" vs "Tuning tables and queries". Which is suitable for you? ;)
As the number of data nodes increases...
- Performance goes down without Distribution Awareness.
- Performance scales with Distribution Awareness.
In order to make use of Distribution Awareness, you have to tune both tables and queries:
- CREATE TABLE ... ENGINE NDB PARTITION BY KEY (userid);
- SELECT ... WHERE ... and userid = 12345;
You can determine if Distribution Awareness is take effect using...:
- EXPLAIN PARTITIONS SELECT ...\G
You application may scale if it is distribution aware. MySQL Cluster's Distribution Awareness comes in handy because you don't have to implement additional logics on your application. So, please consider using MySQL Cluster if you are going to shard your application.