A Japanese Geek's MySQL Blog.

2010-03-30

Not Only NoSQL!! Uber Scaling-Out with SPIDER storage engine

The history tells that a single RDBMS node cannot handle tons of traffics on web system which come from all over the world, no matter how the database is tuned. MySQL has implemented a master/slave style replication built-in for long time, and it has enabled web applications to handle traffics using a scale-out strategy. Having many slaves has been suitable for web sites where most of traffics are reads. Thus, MySQL's master/slave replication has been used on many web sites, and is being used still.

However, when a site grow large, amount of traffic may exceed the replication's capacity. In such a case, people may use memcached. It's an in-memory, very fast and well-known KVS, key value store, and its read throughput is far better than MySQL. It's been used as a cache for web applications to store 'hot' data with MySQL as a back-end storage, as it can reduce read requests to MySQL dramatically.

While 1:N replication can scale read workload and memcached can reduce read requests, it cannot ease write load well. So, write traffic gets higher and higher when a web site becomes huge. On such web sites, a technique called "Sharding" has been used; it's a technique that the application choose an appropriate MySQL server from several servers.

In that way, MySQL+memcached has been a de-fact standard data store on huge web sites for long time.

Since web applications are getting larger still, especially on social media sites, write load is getting higher and higher as people communicate in real-time. In such area, yet another technique is required to handle the write load. Then, some people have chosen NoSQL solutions instead of MySQL+memcached. NoSQL is a kind of buzz word, IMHO, which represents non-relational databases which doesn't require SQL access. Despite lack of SQL access, some NoSQL softwares are suitable for huge scale web applications, like Cassandra. Although people cannot JOIN records on NoSQL system, it is not possible on RDBMS over the shards as well. So, MySQL isn't used as a RDBMS, is used as a data store without joins in other words, on such a web application in the first place.

For further information of this kind of thoughts, I recommend you to read Mark Calleghan's post: http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html
and this post: http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king

Technically, it is possible to handle huge amount of traffics using MySQL, but a running cost gets expensive, Twitter says. As these techniques are separate ones, so those people have to spent their time to learn all of three who implement the application over them and manage them. On the other hand, Cassandra can handle more traffics as a single database management system, so people only have to learn it instead of three. Sounds great? But, is it a really good choice?

No! They're not aware of yet another solution, say SPIDER storage engine!

SPIDER for MySQL
http://spiderformysql.com/

SPIDER is a storage engine developed by a Japanese MySQL hacker, Mr. Kentoku Shiba, it makes use of MySQL's partitioning functionality and store partitioned data onto remote servers. I may say it's a Sharding storage engine. While flexibility of MySQL's storage engine API enables such an engine, but I value Kentoku's design a lot.

The following picture depicts how SPIDER storage engine works. (This is a snippet from the site above.)
In this entry, I do not explain how to use SPIDER storage engine, but I tell you how great its ability is. If you want to try it out, please refer to Giuseppe Maxia's post.

Please look at the following graph, which represents an INSERT performance comparing a single MySQL server (InnoDB), 2 SPIDER node + 2 backend MySQL server and 4 SPIDER node + 4 backend MySQL Server. You can see how good it scales.
The next graph is a SELECT performance. Read scales pretty good as well.
Red circles indicate where working set sizes exceed memory sizes. While performance drops when a working set size exceeds the available memory size, SPIDER is able to expand the memory so that a working set fits in it. SPIDER can make use of memory on all remote servers, as if there is a huge buffer pool in total.

For more information about SPIDER's performance test, please refer to Kentoku's slide. It's surprising.

The most significant problem for twitter is to scale out read/write load with less running cost. Unfortunately, they had chosen NoSQL solution due to the fact that "MySQL replication + memcached + sharding" cannot handle write intensive workload well. However, such a problem can be resolved using SPIDER storage engine with MySQL!

Generally, KVS cannot solve certain problems like below:
  • JOIN
  • Sort (ORDER BY)
  • Aggregation (GROUP BY)
When using KVS, these problems can be handled using MapReduce, however, we can process the same task using a very simple SQL in general. Thus, SQL allows us to develop a complex logic very efficiently. When I ask Kentoku permission to write an article about his storage engine, he told me his philosophy like below:
I think that the most significant benefit to use RDB is its usefulness and flexibility. It is a very important characteristic for developers in order to keep the application competitive, especially for those developers who have to add new features/functionalities day by day, like web services. I develop SPIDER storage engine in order to provide developers such useful and flexible RDB's characteristics, even on the environment where the traffic and data is huge thus Sharding is required.
I 100% agree with his opinion. If you are facing the problem caused by high traffic and huge data just like twitter, please consider to use SPIDER storage engine before migrating to NoSQL solutions.

8 comments:

  1. Are there QPS stats for how Spider stacks up against other engines?

    Jonas' stats for NDB have set the bar pretty high:
    http://jonasoreland.blogspot.com/2010/03/fast-io-and-ndb.html

    ReplyDelete
  2. Hi Gary,

    No. While only InnoDB is compared to SPIDER+InnoDB in the slides, SPIDER can utilize any storage engines. Even NDB can be an underlying storage engine combined with SPIDER. So, comparison will not make sense.

    Kind regards,
    --
    Mikiya

    ReplyDelete
  3. Wah mantap tuh bro.. Pedang Katana memiliki banyak Jenis, memilih katana yang baik dengan melihat bahan pembuatannya dan siapa pembuatnya. Banyak juga pedang yang palsu yang beredar di Indonesia,yang pasti Pedang ini memiliki banyak sekali keunikan dan Manfaat. Mau lihat koleksi saya yang UNIK-UNIK seperti pedang katana asli, pedang damascus, badik, keris dan barang antik lainnya silahkan mampir di web saya di www.keris.biz

    ReplyDelete
  4. Thanks for sharing.Did you know that Spider is the first step when accessing a remote database and a storage engine?

    ReplyDelete
  5. what about this thing called data integrity? if you're using sharding, there is no longer referential or transactional integrity, the results could be disastrous in an environment that's anything more than some blog site

    ReplyDelete
  6. I never heard about Spider storage engine, looks pretty good, do you think it can handle a traffic of, lets say, 1000 people per day?

    ReplyDelete
  7. Packers Movers Online is the free Packers Movers Directory to get the information about Packers And Movers, Movers And Packers in india, packing moving services providers in India.

    ReplyDelete
  8. These facts are really interesting. Few of them were well known for me but many of them were brand new for me too!
    I will print this one out and show to my friends because they will be definitely interested in that. Thanks!
    MySQL

    ReplyDelete