Thursday, November 3, 2016

Can RocksDB and InnoDB play along in MySQL?

My recent post about importing a big dataset from InnoDB into MyRocks has attracted quite a lot of attention (thank you Mark!) and also it has been pointed out that what I wrote about coexistence of MyRocks and InnoDB was incorrect.

In fact, I had been asking about it at Percona Live last month, but got a negative answer.... plus I had tried it at first but got a mysqld crash during crash recovery once, so since it was not important for my use case I went ahead and disabled InnoDB entirely.

But of course, as I have written previously, using both engines in the same server is something that I would really, really love to have, so I thought to give it a try with a very simple use case.

So I created the following two tables, each one with a different engine:

Create Table: CREATE TABLE `some_innodb_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `descr` char(10) COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin

Create Table: CREATE TABLE `some_rocksdb_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `descr` char(10) COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin


The idea was to fill the two tables with some data, then kill the mysqld process and verify that the crash recovery was able to recover both engines fully.

Speaking with my friend, the great René Cannaò, he suggested a very simple way to set up a quick test for the above. Basically, it would be enough to prepare a file with the needed insert statement in them  (one file per engine) and then execute many parallel mysql clients, feeding the file with statements into them.

So I quickly created the two files :

$ for i in $(seq 1 1000000); do echo "insert into some_rocksdb_table values (NULL, '$i');"; done > rocksdb_testfile.sql
$ for i in $(seq 1 1000000); do echo "insert into some_innodb_table values (NULL, '$i');"; done > innodb_testfile.sql

The generated files would look like this:

insert into some_innodb_table values (NULL, '1');
insert into some_innodb_table values (NULL, '2');
insert into some_innodb_table values (NULL, '3');
insert into some_innodb_table values (NULL, '4');
insert into some_innodb_table values (NULL, '5');
.....

Now, I needed to instrument the parallel launch of several mysql clients. The idea is that when the mysql server gets killed, each of the clients will spit out the row number for the statement that it was executing when the server died; so I could just get that number for each of the clients, subtract one, and easily get the number of committed transactions that I should find on each engine. This should work because we are running in auto-commit mode, so basically each statement that we send is within its own transaction.

The trouble here was to capture the information spit out by the mysql client  in an ordered manner, as all the clients would be launched in background from the shell simultaneously. So I crafted a small script that captures each output and saves it to a dedicated log (I had to be able to differentiate the sums of the two engines).

#!/bin/bash
#
echo -n "Password: "
stty -echo
read pass
stty echo
echo
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_1.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_1.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_2.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_2.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_3.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_3.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_4.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_4.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_5.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_5.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_6.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_6.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_7.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_7.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_8.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_8.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_9.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_9.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_10.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_10.log &
wait

The script will launch 20 mysql clients in parallel (by running them in background), 10 for each engine, and capture their output for later perusal. Each client will be fed one million inserts, so if you let this run to completion, it will insert 20 million rows, 10 million per table.

Before launching the above script I restarted the server with the proper setting for a crash safe master, following the recommendations from the Wiki. The config used is the one below; notable options here are sync_binlog=1, innodb_flush_log_at_trx_commit=1 for  (InnoDB) and rocksdb_disable_2pc=OFF (for MyRocks).

Please note that the coexistence of any two engines in MySQL at this time has problems around XA support - I got repeatable crashes and reported these here even if this is very likely an upstream bug.  I have been recommended to enable binary logging to work around this problem, and it actually worked fine when I did.

Another issue I stumbled into while trying to set up this small test is a bug regarding the auto_increment_increment setting, which I accidentally had set to 2 (we use this setup in clusters).  You can check this particular issue here. Hey, turns out I have been quite lucky with my previous use case :-)


# INNO
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 10G

innodb_file_format=barracuda
innodb_log_buffer_size = 64M


# ROCKS

core-file
rocksdb
#skip-innodb
default-storage-engine=rocksdb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin
binlog_format                  = ROW
transaction-isolation = READ-COMMITTED
log_bin                        = /storage/rockstat/binlog/rocks
expire_logs_days               = 1
sync_binlog=1
rocksdb_disable_2pc=OFF


rocksdb_max_open_files=-1
rocksdb_base_background_compactions=1
rocksdb_max_background_compactions=8
rocksdb_max_total_wal_size=4G
rocksdb_max_background_flushes=4
rocksdb_block_size=16384
rocksdb_block_cache_size=16G
rocksdb_table_cache_numshardbits=6
rocksdb_default_cf_options=write_buffer_size=128m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;max_write_buffer_number=4;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true



Well, I launched my script and waited a little while. While waiting I was curious to compare the insert speed between InnoDB and RocksDB:


mysql> select count(*) as innodb from some_innodb_table; select count(*) as rocksdb from some_rocksdb_table;
+---------+
| innodb  |
+---------+
| 4544377 |
+---------+
1 row in set (1.28 sec)

+---------+
| rocksdb |
+---------+
| 4361642 |
+---------+
1 row in set (1.85 sec)


I was very happy to verify that RocksDB was able to keep up with InnoDB in this little insert speed test :-) Actually, InnoDB was about 10% faster, but remember that we are running with MyRock's 2 phase commit enabled here (where it is normally OFF). Also, InnoDB insert performance will weaken over time, where MyRocks should hold just fine.

Anyways, I went and killed the mysqld process with kill -9.  My script exited and left 20 log files for my perusal:

bash-4.1$ ls -l *log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_1.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_10.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_2.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_3.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_4.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_5.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_6.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_7.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_8.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_9.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_1.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_10.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_2.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_3.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_4.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_5.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_6.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_7.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_8.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_9.log

Each file contained the row corresponding to the statement that each client was executing, for example:

-bash-4.1$ cat rocksdb_1.log 
Warning: Using a password on the command line interface can be insecure.
ERROR 2013 (HY000) at line 476838: Lost connection to MySQL server during query

In this example, the client had inserted 476837 rows successfully and died on the 476838th.

Another small script helped me gather the final information that I needed to compare with a count(*) later:

#!/bin/bash 
#
c=$(echo $(fgrep ERROR innodb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "InnoDB count: $((c-1))"
c=$(echo $(fgrep ERROR rocksdb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "RocksDB count: $((c-1))"


Here's what I got after the kill:

InnoDB count: 4985188
RocksDB count: 4769893

Cool!
Now, it was time to start MySQL again and let it do crash recovery, then compare the # of rows in each table with the above numbers, hoping for a match.

Unfortunately the fun stopped here,  with the following assertion:

mysqld: /home/rpizzi/mysql-fb/rocksdb/db/version_builder.cc:142: void rocksdb::VersionBuilder::Rep::CheckConsistency(rocksdb::VersionStorageInfo*): Assertion `f1->largest_seqno > f2->largest_seqno || (f1->largest_seqno == 0 && f2->largest_seqno == 0)' failed.
12:40:42 UTC - mysqld got signal 6 ;
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x50000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x9baa7c]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x3a7)[0x6ff7f7]
/lib64/libpthread.so.0(+0xf7e0)[0x7fa8e07d07e0]
/lib64/libc.so.6(gsignal+0x35)[0x7fa8de77b5e5]
/lib64/libc.so.6(abort+0x175)[0x7fa8de77cdc5]
/lib64/libc.so.6(+0x2b70e)[0x7fa8de77470e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fa8de7747d0]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb14VersionBuilder6SaveToEPNS_18VersionStorageInfoE+0x143)[0xdb6493]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb10VersionSet11LogAndApplyEPNS_16ColumnFamilyDataERKNS_16MutableCFOptionsERKNS_10autovectorIPNS_11VersionEditELm8EEEPNS_17InstrumentedMutexEPNS_9DirectoryEbPKNS_19ColumnFamilyOptionsE+0x3fe)[0xc838fe]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb10VersionSet11LogAndApplyEPNS_16ColumnFamilyDataERKNS_16MutableCFOptionsEPNS_11VersionEditEPNS_17InstrumentedMutexEPNS_9DirectoryEbPKNS_19ColumnFamilyOptionsE+0x44)[0xc0dc54]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb6DBImpl15RecoverLogFilesERKSt6vectorImSaImEEPmb+0x1b8b)[0xc0545b]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb6DBImpl7RecoverERKSt6vectorINS_22ColumnFamilyDescriptorESaIS2_EEbbb+0x8dc)[0xc062bc]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb2DB4OpenERKNS_9DBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISD_EEPSC_IPNS_18ColumnFamilyHandleESaISJ_EEPPS0_+0x784)[0xc06ee4]
/usr/local/mysql/bin/mysqld(_ZN7rocksdb13TransactionDB4OpenERKNS_9DBOptionsERKNS_20TransactionDBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISG_EEPSF_IPNS_18ColumnFamilyHandleESaISM_EEPPS0_+0x819)[0xd2af19]
/usr/local/mysql/bin/mysqld[0xb95a80]
/usr/local/mysql/bin/mysqld(_Z24ha_initialize_handlertonP13st_plugin_int+0x48)[0x619908]
/usr/local/mysql/bin/mysqld[0x79e76b]
/usr/local/mysql/bin/mysqld(_Z11plugin_initPiPPci+0x5e0)[0x79f3c0]
/usr/local/mysql/bin/mysqld[0x60a5e7]
/usr/local/mysql/bin/mysqld(_Z11mysqld_mainiPPc+0x4d9)[0x60d8d9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa8de767d1d]
/usr/local/mysql/bin/mysqld[0x5fd329]


Basically, while InnoDB recovered  just fine, crash recovery for MyRocks failed with an assertion. I have been unable to recover from this situation (even using suggested recovery modes on MyRocks wiki) and had to delete all data :-(

I have raised an issue and hope to be able to repeat this test soon....