Monday, October 31, 2016

MyRocks: migrating a large MySQL dataset from InnoDB to RocksDB to reduce footprint

I have been following Facebook's MyRocks project (and Mark Callaghan's blog) for a long time. The idea of an LSM based engine for MySQL is actually a great idea.

We all know that InnoDB sucks at INSERTs.  BTree in general sucks when it's about insertion speed, and the more rows you insert, the more it sucks at it. There are many blog posts on the web that shows the insert speed degradation in InnoDB when the amount of rows in the table grows. Things get much worse faster if your primary key is a random key, for example an UUID.

We hit this problem with our caching servers (yes, we do caching with MySQL!), and in order to be able to scale these servers up we moved since a couple years to the TokuDB engine with great success. TokuDB is based on fractal tree technology, and guarantees the same insert speed, no matter the number of rows you have in the table; furthermore, it has no problems with random INSERTs (inserting rows with random PK values) and it also achieves great compression when compared to InnoDB.

Enter RocksDB and the LSM (Log Structured Merge) technology: we are entering a new planet as far as data storage goes. The idea of an "append only" type of approach for writing database rows is something I never heard before. Databases usually are the most random writer ever! And although SSD will not suffer from write randomness, there are other factors that need to be taken into consideration. I'll not go into more details about how LSM or RocksDB work, or why LSM is good for your SSD and your workload, as there is plenty of information about it on Mark's blog (see above) or on the official page for RocksDB and for MyRocks (the RocksDB engine for MySQL) on GitHub; just check the Wiki there.

For quite some time I dreamed  to give MyRocks a try with our giant (at least by our standards) write-mostly tracking database, currently occupying about 2.1T of disk space and growing about 200G per month (and yes we DO have data retention in place).  This cluster has a very write intensive workload and the written data is  very seldom read, usually to extract statistical reports, or for application or incident post mortem analysis. At the time of this post, it contains approximately 830 tables.
So I was really excited to hear at the recent Percona Live AMS  Yoshinori Matsunobu announcing that MyRocks was ready for production use.  I explicitly asked him if partitioning was supported by MyRocks (as we use range partitioning to prune old data) and got a positive response. Yay!!

Even if production ready, MyRocks is still in early stages and therefore I needed to clone the git repository and compile the server from source code. I also had to compile the GCC compiler first, as it is a good idea to use one of the GCC versions that were tested by the Facebook team so that I can report issues that may happen later. Peter Z announced that MyRocks will be included in Percona Server soon, but if you want to give it a try now, this is the only way.

Next step was to obtain a consistent dump of the cluster, along with master position in order to add the MyRocks slave to the cluster after the data import. I knew this dump was going to take a long time, so instead of using mysqldump  I decided to use mydumper, the parallel dump-and-load tool that I have used many times in the past, especially to migrate out of Amazon RDS (anyone?). 

Also, I knew that some editing of the dump was required in order to accomplish the following:

- remove references to InnoDB and/or replace them with ROCKSDB
- introduce binary collation required by MyRocks  (insert  "COLLATE latin1_bin" or "COLLATE utf8_bin" in the table definitions)
- remove InnoDB compression where enabled (many of the larger tables are compressed)

So I downloaded, compiled and installed mydumper and after a whopping 44 hours, my dump was complete (and occupying about 2T of disk space uncompressed).

MyDumper creates two files for each table, one contains the table creation command and the other one contains the actual dump of the table's data.  I crafted a simple script that goes through all the table creation files and applies the changes above in them:


#!/bin/bash
#
for f in bigdump/*schema*sql
do
  fn=$(basename $f)
  sed -e "s/InnoDB/ROCKSDB/g" -e "s/ DEFAULT CHARSET=latin1/ DEFAULT CHARSET=latin1 COLLATE=latin1_bin/g" -e "s/ DEFAULT CHARSET=utf8/ DEFAULT CHARSET=utf8 COLLATE=utf8_bin/g" -e "s/ ROW_FORMAT=COMPRESSED/ /g" < bigdump/$fn > bigdump/$fn.mangled
  mv bigdump/$fn tmp/$fn.orig  
  mv bigdump/$fn.mangled bigdump/$fn
done


So I was ready to import this 2T of stuff into MyRocks. Of course, this is also a very time consuming operation, when you have more than a few G's to import. So it is really important to use all the available optimizations in order to speed up the bulk load

Turns out that MyRocks contains quite a few tweaks to help with this; the most notable ones being rocksdb_skip_unique_check, which disables checking on unique constraints, and rocksdb_commit_in_the_middle which only commits every 1,000 rows or so. More info about options suitable for bulk loading data can be found on the Wiki page here.   Beware that these settings cannot be used in regular production mode, they can cause corruption of your data! Therefore it is best to create a dedicated config for bulk data loading purposes only, and use the standard "production" config after the import is completed.

Side note: before you attempt to load large datasets into MyRocks, make sure you set your open file limit high enough! I have learned it the hard way, that MyRocks keeps one file descriptor open for each SST file you have in your database, and the number of SST files can grow very high when you are importing lots of data. In my case, I have topped 34,000 open file descriptors when the load was completed, which are then kept open even during normal use. I had originally reported this issue and following that a better default was recommended to me for the variable target_file_size_base in rocksdb_default_cf_options. Changing the recommended value from 32m to 64m made a big change regarding the amount of file descriptors required, dropping the above number to 12,500.

After a couple failed attempt at importing the data (mostly due to me being too lazy to properly read the available documentation) I finally managed to complete the import with satisfying results, using the configuration below.  You will notice by reading the config that MyRocks cannot at this time coexist with InnoDB. I have been inquiring with Yoshinori about this, and the reason is that at this time, the crash recovery mechanism of the two engines will conflict. I can't wait for the day when we will be able to use RocksDB and InnoDB in the same server!


core-file
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin

# config for bulk load

rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_total_wal_size=4G
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=256;level0_stop_writes_trigger=256;max_write_buffer_number=16;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true
rocksdb_override_cf_options=system={memtable=skip_list:16}
rocksdb_skip_unique_check=1
rocksdb_commit_in_the_middle=1
rocksdb_write_disable_wal=1
rocksdb_max_background_flushes=40
rocksdb_max_background_compactions=40


During the import I have been watching the disk space (footprint) occupied by MyRocks' data directory. MyRocks creates a .rocksdb subfolder in your datadir and stores all its files in there, so I ran a script to monitor both the # of file descriptor in use (with lsof) and the footprint (with du) every 60 seconds:

3888
223G /storage/rockstat/data/.rocksdb
3959
229G /storage/rockstat/data/.rocksdb
3901
219G /storage/rockstat/data/.rocksdb
3924
220G /storage/rockstat/data/.rocksdb
3916
226G /storage/rockstat/data/.rocksdb
3815
211G /storage/rockstat/data/.rocksdb

You can clearly see the compaction of data taking place while the data loads. Compaction pushes data to lower levels, packing and compressing it in the process and therefore the footprint decrease while the import progresses.  Cool!!

But what was really cool is finding out that MyRocks was able to pack 2.1T worth of data into a mere 611G  - reducing an already compressed dataset by more than 3 times! This is even more amazing if you consider that about a dozen of the larger tables in this particular dataset contain BLOBs which have  compressed data in them. In the steps above I had disabled compression at the engine level where it was configured in the source dataset, but of course compressed data in BLOBs will remain compressed and MyRocks did a very good job at compressing regardless.

Importing the dump took about 17  hours. During the import I ran into some stalls that I reported here. Once the import completed I shut the server down and switched the my.cnf file used for bulk load with the "production" one below. Please note that I had to keep rocksdb_block_cache_size low due to other stuff using memory on the test server at the same time;  you should raise this accordingly to the amount of memory available, keeping in mind that MyRocks, similarly to TokuDB, also uses filesystem cache extensively. I didn't find a recommendation on the MyRocks wiki, so I decided to set this parameter to 50% of the available server memory, which is the recommended value for Toku


core-file
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_background_compactions=8
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6

rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true


Using the information saved by mydumper in the "metadata" file I set up replication from the actual InnoDB based slave to the MyRocks slave and let it catch up. I used MTS (Multi Threaded Slave) to speed up the catch up (parallel replication happens to work very well for this cluster's workload) and the slave caught up pretty quickly. After the catch up, I stopped the slave and disabled parallel replication, as it is not compatible with relay_log_recovery in case of a slave crash, and the latter option is needed by MyRocks to guarantee a crash safe slave.


I quickly found out that having a slave which uses an engine different than the master needs also some other tweaks in the configuration. Remember that I had to disable InnoDB in order to make MyRocks work properly.  So what happens when on the master some DDL is executed which explicitly references ENGINE=InnoDB?

Here's the answer:

               Last_SQL_Errno: 1286

               Last_SQL_Error: Worker 7 failed executing transaction '' at master log dbstat02.034383, end_log_pos 99530607; Error 'Unknown storage engine 'innodb'' on query. Default database: 'rick_rocks'. Query: 'create table ttable2 (id int not null auto_increment primary key) engine=innodb'


D'oh! Of course, this is not what I wanted. I quickly remembered about a SQL_MODE option that I always wondered what  could be used forNO_ENGINE_SUBSTITUTION

From the manual page:

With NO_ENGINE_SUBSTITUTION disabled, for CREATE TABLE the default engine is used and a warning occurs if the desired engine is unavailable. For ALTER TABLE, a warning occurs and the table is not altered.


Hey, sounds like exactly what I want!! We never use ALTER TABLE anyways as we do everything  via pt-online-schema-change, so this option should be all we need on the slave.

This option (NO_ENGINE_SUBSTITUTIONis enabled by default since MySQL 5.6.6, so I needed to revert the sql_mode to a blank string, by setting in my.cnf the following:


sql_mode=''


To my dismay, after restarting the MyRocks slave, the behaviour was unchanged.  A bit of digging in the MySQL source code and I discovered that the slave uses the master's SQL_MODE, and not its own! The mode gets written to the binary log for each and every transaction, and is used by the slave when applying events.
So, I went to the master and changed the SQL_MODE there. After restarting replication, now the behaviour is correct and a table created with Engine=INNODB is created with engine ROCKSDB on the slave.


Finally, here's a comparison of the disk usage between a classic "InnoDB" slave and our new "RocksDB" one,  over the last 48 hours:




MyRocks

InnoDB




Can you spot without using the caption which one of the two is our MyRocks slave?  The compaction algorithm can be clearly spotted in the graph.  The two big drops around 6am is when the older partitions are dropped every night to apply retention.

The graphs above show that the MyRocks slave grew from 47.75% to 48%, while the InnoDB slave grew from 77% to 78%.  What  they don't show is that the MyRocks storage is  a mere 1.4T where the InnoDB one is double that, at 2.8T!

So, bottom line is, MyRocks grew 0.25% of 1.4T = 3.5G while InnoDB grew 1% of 2.8T = 28G. This is a 7x improvement!

We will be monitoring this new slave for some time trying to compare CPU metrics and to observe general behaviour, like lag and general reliability.


But for now, I would say that MyRocks really ... rocks!!!  Well done FB Team!

18 comments:

  1. Wow. Sorry for not resolving your open issues. The QA team is small today (me).

    My theory is to use between 20% and 30% of RAM for the RocksDB block cache when sizeof(database) >> sizeof(RAM). But I have yet to confirm that with benchmark results -- http://smalldatum.blogspot.com/2016/09/tuning-rocksdb-block-cache.html

    ReplyDelete
    Replies
    1. For this particular workload, with above config, looks like it's not using more than 3.5G anyways regardless what I have set above, so the unused memory still goes into the OS cache.

      Delete
  2. Mark, you have been really helpful and got me going in the right direction, so a big thank you for your great posts on RocksDB and MyRocks (and for the hard work) and I am really pleased by your visit here to my newborn small space on the Internet. I'm looking forward for the day when RocksDB will coexist with InnoDB and I'm sure I'll find many many use cases for it.

    Regarding the memory for the block size cache, although I have given 50% so far it is only using about 10%, my guess is that it only uses what it needs depending on the actual dataset. I confess I need to learn a bit more about the internals, but I was so eager to give it a try, I just couldn't wait :-)

    Thanks again!
    Rick

    ReplyDelete
  3. It looks like --allow-multiple-engines option will allow both engines at the same time (https://github.com/facebook/mysql-5.6/commit/c380d14ee50059dcefc35da13d8713bbcad72061).

    ReplyDelete
    Replies
    1. Here's my try at it... http://mysqlnoob.blogspot.com/2016/11/can-myrocks-and-innodb-play-along-in-mysql.html

      Delete
  4. MyRocks + InnoDB has been supported for some time. Our communication about that could have been better.
    https://github.com/facebook/mysql-5.6/issues/106

    ReplyDelete
    Replies
    1. I've tried at first to avoid disabling InnoDB but after a mysqld crash I have been unable to complete crash recovery properly. Since I wanted a MyRocks-only slave I didn't fdig the issue deeper. But will certainly experiment with the combo in the coming weeks!

      Delete
    2. Gave it a try but already found issues..... https://github.com/facebook/mysql-5.6/issues/386

      Delete
  5. I have executed a table checksum against the MyRocks slave and am pleased to tell that the slave is 100% consistent with the InnoDB master. Yay!!

    ReplyDelete
  6. Hi,
    I'm new and don't know how to limit memory size that MyRocks use.
    My my.cnf:
    [mysqld]
    rocksdb
    default-storage-engine=rocksdb
    skip-innodb
    default-tmp-storage-engine=MyISAM
    collation-server=utf8_bin
    character-set-server = utf8
    transaction-isolation=READ-COMMITTED

    log-bin
    binlog-format=ROW
    user=zdeploy

    datadir=/home/zdeploy/myrock/data
    port=3333

    rocksdb_max_open_files=-1
    rocksdb_base_background_compactions=1
    rocksdb_max_background_compactions=8
    rocksdb_max_total_wal_size=2G
    rocksdb_max_background_flushes=4
    rocksdb_block_size=16384
    rocksdb_block_cache_size=2G
    rocksdb_table_cache_numshardbits=6

    # rate limiter
    rocksdb_bytes_per_sync=4194304
    rocksdb_wal_bytes_per_sync=4194304
    rocksdb_rate_limiter_bytes_per_sec=104857600 #100MB/s. Increase if you're running on higher spec machines

    # read free replication
    #rocksdb_rpl_lookup_rows=0

    rocksdb_default_cf_options=write_buffer_size=1024m;target_file_size_base=256m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio

    My expectation is MyRocks doesn't use more than 10GB memory of RAM, but now it is using >20G. I read the document but can't find anything can help.

    Thank you!

    ReplyDelete
    Replies
    1. A better place for questions is https://groups.google.com/forum/#!forum/myrocks-dev

      Delete
  7. From where do you get MyRocks -- MariaDB, Percona, FB MySQL?

    Are you using column families?

    When mysqld uses more than 20gb of RAM is that from RSS or VSZ?

    ReplyDelete
    Replies
    1. I get it from FB Mysql : https://github.com/facebook/mysql-5.6

      What does you mean 'using column families' ?
      All my config in my above comment.
      And memory that used by mysqld is from Resident size (RSS).

      Delete
    2. Column family usage is enabled by a special syntax in per-index comments in a create table statement. If you aren't doing that then you aren't using column families.

      From the my.cnf details above I don't know why RSS is 20gb. Which memory allocator do you use -- glibc, tcmalloc or jemalloc? I use jemalloc most of the time and RSS with glibc malloc has been too high in the past -- http://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html

      Delete
  8. I am unhappy with Rocks for one reason, the binary collation does not allow case-insensitive searches, so it breaks any application that was ever working with innnodb or tokudb. Please correct me if I am wrong.

    ReplyDelete
    Replies
    1. True, binary collation brings case sensitive matches. Not sure if the MyRocks team will ever change this requirement (of having binary collation), I think it is tied to how the LSM engine work, but I may be wrong; I haven't looked at the latest MyRocks deployments...

      Delete
    2. I understand your unhappiness. But the team is small and we can't fix everything right away. Of course, we are happy to take contributions.

      Delete
  9. Very helfpul article about migrating DB

    ReplyDelete