Wednesday, December 14, 2016

In place transparent compression of MySQL binary logs

Any DBA who has administered a busy master knows how fast the disk space occupied by binary logs may grow.  DBAs have no control on this: the growth depends on the workload, and the workload depends on many factors, e.g.:

- application changes (the applications start writing more due to code changes)
- traffic changes (the peak season arrive, your workload doubles in size)
- infrastructure changes (the devops add more servers)
- business changes (new business flows adds to existing workload)

So either you have being thoughtful and have planned in advance for a large enough storage space (to handle the increase in number of binary logs), or,  sooner or later, you will face the usual dilemma - how many retention days dare you give up to accommodate for the binlog growth?

In my case, I was very thoughtful, but the boss didn't listen and gave me servers with very limited binlog storage space and, more important, using internal disks so no way to add more disk space on demand. Pity, as I had previously found a way to move binlogs to a different filesystem without downtime. But unfortunately, there is no "different filesystem" this time. I'm stuck with what I have.

So, the obvious answer at first was to reduce the retention of binary logs via the
expire_logs_days variable. Not the ideal option, as in my situation, I had to reduce it to a mere 2 days, not even enough to survive a weekend. At the same time I didn't like the idea to keep more days and be waken up at 3am by a filesystem full alert.

So the idea of compressing the binary logs came to my mind, but I wanted to do it in a transparent way, preserving the existing binlog retention mechanisms (expire_logs_days variable and PURGE BINARY LOGS command should continue to work as before) and without breaking the MySQL server and/or replication. Ideally, the timestamps of the files themselves should also preserved.

I decided to try and compress the binary logs in place. In place means that when you look at your binlogs folder, you don't notice anything unusual.  Here's (a part of ) the listing of one of our servers with compression active:

-rw-rw---- 1 mysql mysql  13631245 Dec 14 14:17 dbmd01.079826
-rw-rw---- 1 mysql mysql  13909321 Dec 14 14:23 dbmd01.079827
-rw-rw---- 1 mysql mysql  13656948 Dec 14 14:30 dbmd01.079828
-rw-rw---- 1 mysql mysql  13915222 Dec 14 14:37 dbmd01.079829
-rw-rw---- 1 mysql mysql 104883640 Dec 14 14:44 dbmd01.079830
-rw-rw---- 1 mysql mysql 104898032 Dec 14 14:50 dbmd01.079831
-rw-rw---- 1 mysql mysql 104861122 Dec 14 14:57 dbmd01.079832

You can spot which files are compressed by looking at the file size, but everything else is preserved. Looking with file, here's what's in each of them:

dbmd01.079826: gzip compressed data
dbmd01.079827: gzip compressed data
dbmd01.079828: gzip compressed data
dbmd01.079829: gzip compressed data
dbmd01.079830: MySQL replication log
dbmd01.079831: MySQL replication log
dbmd01.079832: MySQL replication log

Basically, after compression, the compressed file retains the original file name. This is to keep MySQL happy - you know, these files are tracked in the index file and read when mysqld starts.
Speaking of which, here's what happens when you start mysqld  with in-place compression:

2016-10-26 03:06:13 0 [Note] /usr/sbin/mysqld (mysqld 5.6.34-79.1-log) starting as process 1407 ...
[ ... ]
2016-10-26 03:06:13 1407 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 2722915
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL

Although these entries are flagged as ERROR, actually mysqld just ignores the compressed files, and more importantly, does not touch them.  

This way, compressed binlog files can be managed exactly like uncompressed, standard ones. Of course, I need to remember that if I need to peruse their content, I have to uncompress them before feeding them to the mysqlbinlog utility.

Last but not least, in order to ensure that we do not inadvertently compress binlogs that haven't been downloaded by all slaves yet,  the following extra safety rules are enforced:

- do not compress any binlog which is not at least 1 hour old 
- only compress binlogs that have a sequence number which is less than the current sequence number  of the most lagging slave

The script runs every 10 minutes, extracts the list of the binary logs which aren't compressed yet, applies the above rules, then it proceeds to compress them.

I have set expire_logs_days back to 7 days and have plenty of disk space for extra growth now....



You can find the  compress_binlog.sh which I crafted for the above on my GitHub page.

One caveat  - script hasn't been tested with binlog sequence numbers greater than 999,999. I am actually curious to see what happens to the binlog file name when we reach one million - need to either be very patient, or have a look at the source code... :-)

Enjoy!

Rick

Thursday, November 3, 2016

Can RocksDB and InnoDB play along in MySQL?

My recent post about importing a big dataset from InnoDB into MyRocks has attracted quite a lot of attention (thank you Mark!) and also it has been pointed out that what I wrote about coexistence of MyRocks and InnoDB was incorrect.

In fact, I had been asking about it at Percona Live last month, but got a negative answer.... plus I had tried it at first but got a mysqld crash during crash recovery once, so since it was not important for my use case I went ahead and disabled InnoDB entirely.

But of course, as I have written previously, using both engines in the same server is something that I would really, really love to have, so I thought to give it a try with a very simple use case.

So I created the following two tables, each one with a different engine:

Create Table: CREATE TABLE `some_innodb_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `descr` char(10) COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin

Create Table: CREATE TABLE `some_rocksdb_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `descr` char(10) COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin


The idea was to fill the two tables with some data, then kill the mysqld process and verify that the crash recovery was able to recover both engines fully.

Speaking with my friend, the great René Cannaò, he suggested a very simple way to set up a quick test for the above. Basically, it would be enough to prepare a file with the needed insert statement in them  (one file per engine) and then execute many parallel mysql clients, feeding the file with statements into them.

So I quickly created the two files :

$ for i in $(seq 1 1000000); do echo "insert into some_rocksdb_table values (NULL, '$i');"; done > rocksdb_testfile.sql
$ for i in $(seq 1 1000000); do echo "insert into some_innodb_table values (NULL, '$i');"; done > innodb_testfile.sql

The generated files would look like this:

insert into some_innodb_table values (NULL, '1');
insert into some_innodb_table values (NULL, '2');
insert into some_innodb_table values (NULL, '3');
insert into some_innodb_table values (NULL, '4');
insert into some_innodb_table values (NULL, '5');
.....

Now, I needed to instrument the parallel launch of several mysql clients. The idea is that when the mysql server gets killed, each of the clients will spit out the row number for the statement that it was executing when the server died; so I could just get that number for each of the clients, subtract one, and easily get the number of committed transactions that I should find on each engine. This should work because we are running in auto-commit mode, so basically each statement that we send is within its own transaction.

The trouble here was to capture the information spit out by the mysql client  in an ordered manner, as all the clients would be launched in background from the shell simultaneously. So I crafted a small script that captures each output and saves it to a dedicated log (I had to be able to differentiate the sums of the two engines).

#!/bin/bash
#
echo -n "Password: "
stty -echo
read pass
stty echo
echo
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_1.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_1.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_2.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_2.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_3.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_3.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_4.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_4.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_5.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_5.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_6.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_6.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_7.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_7.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_8.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_8.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_9.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_9.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_10.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_10.log &
wait

The script will launch 20 mysql clients in parallel (by running them in background), 10 for each engine, and capture their output for later perusal. Each client will be fed one million inserts, so if you let this run to completion, it will insert 20 million rows, 10 million per table.

Before launching the above script I restarted the server with the proper setting for a crash safe master, following the recommendations from the Wiki. The config used is the one below; notable options here are sync_binlog=1, innodb_flush_log_at_trx_commit=1 for  (InnoDB) and rocksdb_disable_2pc=OFF (for MyRocks).

Please note that the coexistence of any two engines in MySQL at this time has problems around XA support - I got repeatable crashes and reported these here even if this is very likely an upstream bug.  I have been recommended to enable binary logging to work around this problem, and it actually worked fine when I did.

Another issue I stumbled into while trying to set up this small test is a bug regarding the auto_increment_increment setting, which I accidentally had set to 2 (we use this setup in clusters).  You can check this particular issue here. Hey, turns out I have been quite lucky with my previous use case :-)


# INNO
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 10G

innodb_file_format=barracuda
innodb_log_buffer_size = 64M


# ROCKS

core-file
rocksdb
#skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
binlog_format                  = ROW
transaction-isolation = READ-COMMITTED
log_bin                        = /storage/rockstat/binlog/rocks
expire_logs_days               = 1
sync_binlog=1
rocksdb_disable_2pc=OFF


rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_background_compactions=8
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true



Well, I launched my script and waited a little while. While waiting I was curious to compare the insert speed between InnoDB and RocksDB:


mysql> select count(*) as innodb from some_innodb_table; select count(*) as rocksdb from some_rocksdb_table;
+---------+
| innodb  |
+---------+
| 4544377 |
+---------+
1 row in set (1.28 sec)

+---------+
| rocksdb |
+---------+
| 4361642 |
+---------+
1 row in set (1.85 sec)


I was very happy to verify that RocksDB was able to keep up with InnoDB in this little insert speed test :-) Actually, InnoDB was about 10% faster, but remember that we are running with MyRock's 2 phase commit enabled here (where it is normally OFF). Also, InnoDB insert performance will weaken over time, where MyRocks should hold just fine.

Anyways, I went and killed the mysqld process with kill -9.  My script exited and left 20 log files for my perusal:

bash-4.1$ ls -l *log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_1.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_10.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_2.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_3.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_4.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_5.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_6.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_7.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_8.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_9.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_1.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_10.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_2.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_3.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_4.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_5.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_6.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_7.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_8.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_9.log

Each file contained the row corresponding to the statement that each client was executing, for example:

-bash-4.1$ cat rocksdb_1.log 
Warning: Using a password on the command line interface can be insecure.
ERROR 2013 (HY000) at line 476838: Lost connection to MySQL server during query

In this example, the client had inserted 476837 rows successfully and died on the 476838th.

Another small script helped me gather the final information that I needed to compare with a count(*) later:

#!/bin/bash 
#
c=$(echo $(fgrep ERROR innodb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "InnoDB count: $((c-1))"
c=$(echo $(fgrep ERROR rocksdb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "RocksDB count: $((c-1))"


Here's what I got after the kill:

InnoDB count: 4985188
RocksDB count: 4769893

Cool!
Now, it was time to start MySQL again and let it do crash recovery, then compare the # of rows in each table with the above numbers, hoping for a match.

Unfortunately the fun stopped here,  with the following assertion:

mysqld: /home/rpizzi/mysql-fb/rocksdb/db/version_builder.cc:142: void rocksdb::VersionBuilder::Rep::CheckConsistency(rocksdb::VersionStorageInfo*): Assertion `f1->largest_seqno > f2->largest_seqno || (f1->largest_seqno == 0 && f2->largest_seqno == 0)' failed.
12:40:42 UTC - mysqld got signal 6 ;
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x50000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x9baa7c]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x3a7)[0x6ff7f7]
/lib64/libpthread.so.0(+0xf7e0)[0x7fa8e07d07e0]
/lib64/libc.so.6(gsignal+0x35)[0x7fa8de77b5e5]
/lib64/libc.so.6(abort+0x175)[0x7fa8de77cdc5]
/lib64/libc.so.6(+0x2b70e)[0x7fa8de77470e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fa8de7747d0]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb14VersionBuilder6SaveToEPNS_18VersionStorageInfoE+0x143)[0xdb6493]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb10VersionSet11LogAndApplyEPNS_16ColumnFamilyDataERKNS_16MutableCFOptionsERKNS_10autovectorIPNS_11VersionEditELm8EEEPNS_17InstrumentedMutexEPNS_9DirectoryEbPKNS_19ColumnFamilyOptionsE+0x3fe)[0xc838fe]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb10VersionSet11LogAndApplyEPNS_16ColumnFamilyDataERKNS_16MutableCFOptionsEPNS_11VersionEditEPNS_17InstrumentedMutexEPNS_9DirectoryEbPKNS_19ColumnFamilyOptionsE+0x44)[0xc0dc54]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb6DBImpl15RecoverLogFilesERKSt6vectorImSaImEEPmb+0x1b8b)[0xc0545b]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb6DBImpl7RecoverERKSt6vectorINS_22ColumnFamilyDescriptorESaIS2_EEbbb+0x8dc)[0xc062bc]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb2DB4OpenERKNS_9DBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISD_EEPSC_IPNS_18ColumnFamilyHandleESaISJ_EEPPS0_+0x784)[0xc06ee4]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb13TransactionDB4OpenERKNS_9DBOptionsERKNS_20TransactionDBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISG_EEPSF_IPNS_18ColumnFamilyHandleESaISM_EEPPS0_+0x819)[0xd2af19]
/usr/local/mysql/bin/mysqld[0xb95a80]
/usr/local/mysql/bin/mysqld(_Z24ha_initialize_handlertonP13st_plugin_int+0x48)[0x619908]
/usr/local/mysql/bin/mysqld[0x79e76b]
/usr/local/mysql/bin/mysqld(_Z11plugin_initPiPPci+0x5e0)[0x79f3c0]
/usr/local/mysql/bin/mysqld[0x60a5e7]
/usr/local/mysql/bin/mysqld(_Z11mysqld_mainiPPc+0x4d9)[0x60d8d9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa8de767d1d]
/usr/local/mysql/bin/mysqld[0x5fd329]


Basically, while InnoDB recovered  just fine, crash recovery for MyRocks failed with an assertion. I have been unable to recover from this situation (even using suggested recovery modes on MyRocks wiki) and had to delete all data :-(

I have raised an issue and hope to be able to repeat this test soon....

Monday, October 31, 2016

MyRocks: migrating a large MySQL dataset from InnoDB to RocksDB to reduce footprint

I have been following Facebook's MyRocks project (and Mark Callaghan's blog) for a long time. The idea of an LSM based engine for MySQL is actually a great idea.

We all know that InnoDB sucks at INSERTs.  BTree in general sucks when it's about insertion speed, and the more rows you insert, the more it sucks at it. There are many blog posts on the web that shows the insert speed degradation in InnoDB when the amount of rows in the table grows. Things get much worse faster if your primary key is a random key, for example an UUID.

We hit this problem with our caching servers (yes, we do caching with MySQL!), and in order to be able to scale these servers up we moved since a couple years to the TokuDB engine with great success. TokuDB is based on fractal tree technology, and guarantees the same insert speed, no matter the number of rows you have in the table; furthermore, it has no problems with random INSERTs (inserting rows with random PK values) and it also achieves great compression when compared to InnoDB.

Enter RocksDB and the LSM (Log Structured Merge) technology: we are entering a new planet as far as data storage goes. The idea of an "append only" type of approach for writing database rows is something I never heard before. Databases usually are the most random writer ever! And although SSD will not suffer from write randomness, there are other factors that need to be taken into consideration. I'll not go into more details about how LSM or RocksDB work, or why LSM is good for your SSD and your workload, as there is plenty of information about it on Mark's blog (see above) or on the official page for RocksDB and for MyRocks (the RocksDB engine for MySQL) on GitHub; just check the Wiki there.

For quite some time I dreamed  to give MyRocks a try with our giant (at least by our standards) write-mostly tracking database, currently occupying about 2.1T of disk space and growing about 200G per month (and yes we DO have data retention in place).  This cluster has a very write intensive workload and the written data is  very seldom read, usually to extract statistical reports, or for application or incident post mortem analysis. At the time of this post, it contains approximately 830 tables.
So I was really excited to hear at the recent Percona Live AMS  Yoshinori Matsunobu announcing that MyRocks was ready for production use.  I explicitly asked him if partitioning was supported by MyRocks (as we use range partitioning to prune old data) and got a positive response. Yay!!

Even if production ready, MyRocks is still in early stages and therefore I needed to clone the git repository and compile the server from source code. I also had to compile the GCC compiler first, as it is a good idea to use one of the GCC versions that were tested by the Facebook team so that I can report issues that may happen later. Peter Z announced that MyRocks will be included in Percona Server soon, but if you want to give it a try now, this is the only way.

Next step was to obtain a consistent dump of the cluster, along with master position in order to add the MyRocks slave to the cluster after the data import. I knew this dump was going to take a long time, so instead of using mysqldump  I decided to use mydumper, the parallel dump-and-load tool that I have used many times in the past, especially to migrate out of Amazon RDS (anyone?). 

Also, I knew that some editing of the dump was required in order to accomplish the following:

- remove references to InnoDB and/or replace them with ROCKSDB
- introduce binary collation required by MyRocks  (insert  "COLLATE latin1_bin" or "COLLATE utf8_bin" in the table definitions)
- remove InnoDB compression where enabled (many of the larger tables are compressed)

So I downloaded, compiled and installed mydumper and after a whopping 44 hours, my dump was complete (and occupying about 2T of disk space uncompressed).

MyDumper creates two files for each table, one contains the table creation command and the other one contains the actual dump of the table's data.  I crafted a simple script that goes through all the table creation files and applies the changes above in them:


#!/bin/bash
#
for f in bigdump/*schema*sql
do
  fn=$(basename $f)
  sed -e "s/InnoDB/ROCKSDB/g" -e "s/ DEFAULT CHARSET=latin1/ DEFAULT CHARSET=latin1 COLLATE=latin1_bin/g" -e "s/ DEFAULT CHARSET=utf8/ DEFAULT CHARSET=utf8 COLLATE=utf8_bin/g" -e "s/ ROW_FORMAT=COMPRESSED/ /g" < bigdump/$fn > bigdump/$fn.mangled
  mv bigdump/$fn tmp/$fn.orig  
  mv bigdump/$fn.mangled bigdump/$fn
done


So I was ready to import this 2T of stuff into MyRocks. Of course, this is also a very time consuming operation, when you have more than a few G's to import. So it is really important to use all the available optimizations in order to speed up the bulk load

Turns out that MyRocks contains quite a few tweaks to help with this; the most notable ones being rocksdb_skip_unique_check, which disables checking on unique constraints, and rocksdb_commit_in_the_middle which only commits every 1,000 rows or so. More info about options suitable for bulk loading data can be found on the Wiki page here.   Beware that these settings cannot be used in regular production mode, they can cause corruption of your data! Therefore it is best to create a dedicated config for bulk data loading purposes only, and use the standard "production" config after the import is completed.

Side note: before you attempt to load large datasets into MyRocks, make sure you set your open file limit high enough! I have learned it the hard way, that MyRocks keeps one file descriptor open for each SST file you have in your database, and the number of SST files can grow very high when you are importing lots of data. In my case, I have topped 34,000 open file descriptors when the load was completed, which are then kept open even during normal use. I had originally reported this issue and following that a better default was recommended to me for the variable target_file_size_base in rocksdb_default_cf_options. Changing the recommended value from 32m to 64m made a big change regarding the amount of file descriptors required, dropping the above number to 12,500.

After a couple failed attempt at importing the data (mostly due to me being too lazy to properly read the available documentation) I finally managed to complete the import with satisfying results, using the configuration below.  You will notice by reading the config that MyRocks cannot at this time coexist with InnoDB. I have been inquiring with Yoshinori about this, and the reason is that at this time, the crash recovery mechanism of the two engines will conflict. I can't wait for the day when we will be able to use RocksDB and InnoDB in the same server!


core-file
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin

# config for bulk load

rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_total_wal_size=4G
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=256;level0_stop_writes_trigger=256;max_write_buffer_number=16;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true
rocksdb_override_cf_options=system={memtable=skip_list:16}
rocksdb_skip_unique_check=1
rocksdb_commit_in_the_middle=1
rocksdb_write_disable_wal=1
rocksdb_max_background_flushes=40
rocksdb_max_background_compactions=40


During the import I have been watching the disk space (footprint) occupied by MyRocks' data directory. MyRocks creates a .rocksdb subfolder in your datadir and stores all its files in there, so I ran a script to monitor both the # of file descriptor in use (with lsof) and the footprint (with du) every 60 seconds:

3888
223G /storage/rockstat/data/.rocksdb
3959
229G /storage/rockstat/data/.rocksdb
3901
219G /storage/rockstat/data/.rocksdb
3924
220G /storage/rockstat/data/.rocksdb
3916
226G /storage/rockstat/data/.rocksdb
3815
211G /storage/rockstat/data/.rocksdb

You can clearly see the compaction of data taking place while the data loads. Compaction pushes data to lower levels, packing and compressing it in the process and therefore the footprint decrease while the import progresses.  Cool!!

But what was really cool is finding out that MyRocks was able to pack 2.1T worth of data into a mere 611G  - reducing an already compressed dataset by more than 3 times! This is even more amazing if you consider that about a dozen of the larger tables in this particular dataset contain BLOBs which have  compressed data in them. In the steps above I had disabled compression at the engine level where it was configured in the source dataset, but of course compressed data in BLOBs will remain compressed and MyRocks did a very good job at compressing regardless.

Importing the dump took about 17  hours. During the import I ran into some stalls that I reported here. Once the import completed I shut the server down and switched the my.cnf file used for bulk load with the "production" one below. Please note that I had to keep rocksdb_block_cache_size low due to other stuff using memory on the test server at the same time;  you should raise this accordingly to the amount of memory available, keeping in mind that MyRocks, similarly to TokuDB, also uses filesystem cache extensively. I didn't find a recommendation on the MyRocks wiki, so I decided to set this parameter to 50% of the available server memory, which is the recommended value for Toku


core-file
rocksdb
skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_background_compactions=8
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6

rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true


Using the information saved by mydumper in the "metadata" file I set up replication from the actual InnoDB based slave to the MyRocks slave and let it catch up. I used MTS (Multi Threaded Slave) to speed up the catch up (parallel replication happens to work very well for this cluster's workload) and the slave caught up pretty quickly. After the catch up, I stopped the slave and disabled parallel replication, as it is not compatible with relay_log_recovery in case of a slave crash, and the latter option is needed by MyRocks to guarantee a crash safe slave.


I quickly found out that having a slave which uses an engine different than the master needs also some other tweaks in the configuration. Remember that I had to disable InnoDB in order to make MyRocks work properly.  So what happens when on the master some DDL is executed which explicitly references ENGINE=InnoDB?

Here's the answer:

               Last_SQL_Errno: 1286

               Last_SQL_Error: Worker 7 failed executing transaction '' at master log dbstat02.034383, end_log_pos 99530607; Error 'Unknown storage engine 'innodb'' on query. Default database: 'rick_rocks'. Query: 'create table ttable2 (id int not null auto_increment primary key) engine=innodb'


D'oh! Of course, this is not what I wanted. I quickly remembered about a SQL_MODE option that I always wondered what  could be used forNO_ENGINE_SUBSTITUTION

From the manual page:

With NO_ENGINE_SUBSTITUTION disabled, for CREATE TABLE the default engine is used and a warning occurs if the desired engine is unavailable. For ALTER TABLE, a warning occurs and the table is not altered.


Hey, sounds like exactly what I want!! We never use ALTER TABLE anyways as we do everything  via pt-online-schema-change, so this option should be all we need on the slave.

This option (NO_ENGINE_SUBSTITUTIONis enabled by default since MySQL 5.6.6, so I needed to revert the sql_mode to a blank string, by setting in my.cnf the following:


sql_mode=''


To my dismay, after restarting the MyRocks slave, the behaviour was unchanged.  A bit of digging in the MySQL source code and I discovered that the slave uses the master's SQL_MODE, and not its own! The mode gets written to the binary log for each and every transaction, and is used by the slave when applying events.
So, I went to the master and changed the SQL_MODE there. After restarting replication, now the behaviour is correct and a table created with Engine=INNODB is created with engine ROCKSDB on the slave.


Finally, here's a comparison of the disk usage between a classic "InnoDB" slave and our new "RocksDB" one,  over the last 48 hours:




MyRocks

InnoDB




Can you spot without using the caption which one of the two is our MyRocks slave?  The compaction algorithm can be clearly spotted in the graph.  The two big drops around 6am is when the older partitions are dropped every night to apply retention.

The graphs above show that the MyRocks slave grew from 47.75% to 48%, while the InnoDB slave grew from 77% to 78%.  What  they don't show is that the MyRocks storage is  a mere 1.4T where the InnoDB one is double that, at 2.8T!

So, bottom line is, MyRocks grew 0.25% of 1.4T = 3.5G while InnoDB grew 1% of 2.8T = 28G. This is a 7x improvement!

We will be monitoring this new slave for some time trying to compare CPU metrics and to observe general behaviour, like lag and general reliability.


But for now, I would say that MyRocks really ... rocks!!!  Well done FB Team!