Monday, February 6, 2017

How to expand a striped LVM database volume in Amazon AWS without downtime

This procedure can be used to expand an LVM  database volume on Amazon AWS (but also apply to any storage area network environment equally).
Let me start with this assumption: when you create volumes for database use in AWS using EBS, you stripe data across them in order to enhance performance.  If you aren't doing this... well, you should :-)
Under this assumption, when you need to add more disk space to an existing database volume, you can't just add the disk(s) to the volume, as this would make the added space non striped, and would eventually create hotspots in the dataset. The correct approach in this situation is to create a number of new EBS disks enough to contain entire dataset plus the desired added space,so that you can grow the existing dataset while re-striping properly.

To make this clear, let's suppose you have a dataset volume of  3 TB,  made of 3 1TB EBS volumes which are striped across, but space is running out. You are in a hurry, so you quickly add a new 1 TB EBS volume to the AWS server, add it to the dataset LVM, and expand the volume, as follows:

# add the new EBS disk to LVM
pvcreate /dev/xvdl
# add the new EBS disk to the dataset volume
vgextend dataset /dev/xvdl
# extend the dataset volume using the new EBS disk
lvextend -i 1 -r /dev/dataset/db /dev/xvdl

This will add space to the dataset volume, however, the fourth EBS volume is just appended to the original striped dataset volume, not striped, thus creating possible hotspots.

The proper way to expand the dataset so that it remains striped and therefore with optimum performance, is to add 4 new EBS disks, migrate the entire dataset on them, then release the old 4 disks.

The good news is that this can be done without downtime,  if you follow the steps below.

IMPORTANT: before proceeding, make sure you have valid backups of the database!!

Now, let's suppose you have added 4 EBS 1 TB disks  to the server already, with following path:  


The original volume, after you added the "emergency" 1 TB EBS disk above, is using:


The procedure involves migrating the dataset from one group of EBS disks to the other by setting up a mirror, let it sync, then dropping the old mirror side to release the old disks. The important step here is to create the mirror so that it will stripe across the EBS disks properly.

# add new EBS disks to LVM
pvcreate /dev/xvd[ponm]
# add new EBS disks to dataset volume
vgextend dataset /dev/xvd[ponm]
# create a mirror out of existing dataset volume
lvconvert -b --stripes 4 -m1 --mirrorlog core /dev/dataset/db /dev/xvd[ponm]

The above commands just add the new disks to LVM, then to the volume, and then the mirror process is started.
You have to wait until the mirror sync is complete, this may take some time, even days depending on the dataset size. You can verify it is synced using  the following command and output:

# lvs -a -o +devices
  LV                     VG      Attr             LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                                            
  bck                   backup  -wi-ao----     3.00t                                                                                       /dev/xvdi(0),/dev/xvdj(0),/dev/xvdk(0)             
  db                      dataset mwi-aom--- 4.00t                                                                  100.00           db_mimage_0(0),db_mimage_1(0)                      
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdf(0),/dev/xvdg(0),/dev/xvdh(0)             
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdl(0)                                       
  [db_mimage_1] dataset iwi-aom---   4.00t                                                                                        /dev/xvdm(0),/dev/xvdn(0),/dev/xvdo(0),/dev/xvdp(0)

You need to check that he copy sync column shows 100% synced.  You can run this command to see progress at any time. The syncing is done at very low priority and there is little to no overhead, and zero impact on database.   In the example above you can clearly see that the db_mimage0 logical volume, which includes the original side of the mirror,  is actually made by 3 striped disks (f, g, h) and one single striped disk (l), this is what we want to avoid and if you look at db_mimage1 (the new mirror side) you can clearly see it is properly striped over 4 disks now.

Once the mirror is in sync we can safely drop the original side of the mirror, then remove the physical disks from LVM and destroy them on the EBS side:

# removes old side of the mirror
lvconvert -m0 /dev/dataset/db /dev/xvd[fghl]
# removes EBS disks from the dataset VG
vgreduce dataset /dev/xvd[fghl]
# removes EBS disks from LVM
pvremove /dev/xvd[fghl]

Done!! You can safely go to the AWS console and detach the EBS disks you just removed from LVM.
Be careful not to remove the wrong disks, or you will destroy your dataset and database!

Tuesday, January 10, 2017

Galera Cluster: adding a primary key to a MySQL table which lacks it... without downtime

OK, let me start with saying that a table without a primary key shouldn't be something that a DBA should ever stumble into.  Sure, InnoDB will secretly add one for us without telling - but that's not something we can use at application or administration level, it's just to make the engine happy.

So let's suppose you find some tables that lack a primary key, of course you need to do something about it, right?  Now, put Galera Cluster in the mix - Galera does not support tables without a primary key and will, secretly again, make that table inconsistent at cluster level.

You need to fix the damn table(s)!! And this is where the fun begins...  as you can't afford downtime for the operation so you need to resort to an online schema change of some type.

Galera cluster is a very nice HA solution and personally I have embraced it since few years now. I really love it, especially its ability to self-heal when something happens (and when you're in the cloud, there's  always something happening). Galera is very different from the classic MySQL replication, and the biggest difference is the fact that it replicates synchronously instead of asynchronously. Now, I will not get into details about how Galera works, there are plenty of blog posts and very good documentation online.  But there is something important that you need to be aware of, when running Galera:  in the standard operating mode of the cluster, any DDL that you will execute will freeze the entire cluster for the entire duration of the statement. They call this "Total Order Isolation" (TOI) and it is the default setting for the cluster.

Guess what. Months after migrating one of our clusters from classic master-master topology to Galera Cluster, I suddenly found myself in need to clean up some old and barely used schemas, which contained quite a few tables lacking a primary key (or any index, for that matter). My bad, I spent a lot of time checking the main schemas and tables and making sure they were optimized for Galera, but I left the unimportant ones behind and never went back checking on them.

So, here I am with my apparently simple (although boring) task - I need to add a primary key to a set of tables and I don't want downtime and/or block the cluster as a whole in the process.
Preliminary checks with DEVs confirmed that all INSERTs going into the tables I needed to fix were using an explicit field list, so the change was compatible (it is important to remember to check this, in order to avoid application failures).

So, first thing that I tried was to just add the PK using the great pt-online-schema-change tool from the Percona toolkit, which supports Galera Cluster just fine.  Unfortunately, I was quickly remembered (by an error message) that the tool cannot operate if the table lacks a primary key or unique index, except when the column that you want to be primary key is already present in the table, which was not my case. D'oh!

Failing that, I thought about running the ALTER statement directly, without using an online schema change tool.  Of course, this would have to be done on all nodes, one node at a time, after changing the wsrep_OSU_method global variable of the cluster to 'RSU'.  RSU stands for "Rolling Schema Upgrade" - more info here. In this mode, any DDL that you execute will only apply to the current node and will not be propagated to other nodes; the node will be evicted from the cluster during the operation, and will rejoin afterwards, but again, the DDL will NOT replicate. 
So you have to run it on each node in order to complete the change.  Boring, and not very practical, right. May be good to add an index on a write-heavy critical table, but of course not to change a table definition... especially if you got many tables to change! But I was prepared to do it at any cost anyways (I eventually felt guilty for not checking this before the migration), so I went to one of the nodes, changed the OSU method to RSU, and executed the following DDL:


The operation completed  OK, but shortly thereafter, I lost connection to the node (infamous MySQL has gone away.....)  WTF?!?    I verified and the mysqld process was dead.  While it is not a big deal for a 3 node cluster to lose one node in our setup, I did not expect the node to die as a consequence of a DDL. But a quick look at the error log, and I immediately realized what the problem was:

2017-01-10 13:53:06 16667 [ERROR] Slave SQL: Column 0 of table 'myschema.mytable' cannot be converted from type 'char(3)' to type 'int(11) unsigned', Error_code: 1677
2017-01-10 13:53:06 16667 [ERROR] WSREP: Failed to apply trx 760719957 4 times
2017-01-10 13:53:06 16667 [ERROR] WSREP: Node consistency compromized, aborting...

What did just happen? Galera replicates between nodes using ROW based replication - this means that the row data gets replicated, instead of the statement that inserts or updates it. This is done for many good reasons (compatibility with non deterministic statements, for example). But in my situation, this was biting me hard - a row inserted by the application in one of the other cluster nodes reached my node where the table structure had been altered, the local node found that the first field of the row was of a different type, and failed as a result. In order to preserve consistency, after some unsuccessful retries, Galera aborted the MySQL server on the node.

Epic fail....

But I still needed to fix the damn tables. While the node was being (automatically) rebuilt, I spent some time thinking about a way to complete this task.  Replication had failed because the first column of the modified table was found to be different type vs. the first column of the table on other nodes. Hmmm, what about moving the new PK column to the end of the table instead of keeping it at the top, I thought. Ugly for sure, but I was curious to see if replication would work correctly in this case. In theory, the columns are matched by position in RBR, so replication should not complain after all.
Pronto, as soon as the node came back in service I set it once again in RSU mode and executed my DDL again:


Note that the only difference here is that I omitted the FIRST clause at the end this time. This will cause the new column to be added to the bottom of the table, as last column.
After executing the DDL I anxiously watched the table waiting for a replicated row to arrive... and I was really happy to see that rows were entering without problems now. Yay!

But wait, how could I now repeat these steps on the other nodes in a consistent way? The new shiny primary key was an auto_increment one (and it had to be, because the application couldn't populate it for me, it didn't even know this column existed),  but if I was to repeat the above DDL on other nodes, I would get different PK values inserted in each row - Galera Cluster uses different auto_increment values on each node by design (this is something we also use in classic master-master setups by the way).  One option could have been to play with the global variables and change the increment manually on each node before executing my DDL, then put it back to what it was.... but I felt this too dangerous and error prone, and also a bit complicated to repeat for each and every table that I had to fix.

And, last but not least, I wanted the damn PK to be first column in the table (yes I'm picky).

But wait again, now that I've got a PK on this table - pt-online-schema-change could do the hard work for me maybe? pt-osc normally works on the master, creating a copy of the original table with the desired changes, and populates it by writing chunks of rows read from the original table. The original table is never written to, and is not checked on the other nodes.  It shouldn't matter that the original table does not have a PK on the other nodes. I thought the tool should be able to complete my change.

I modified the wsrep_OSU_method and put it back to the default value of 'TOI', so that all changes would be propagated (remember that pt-osc executes DDLs to swap tables at the end of the copy process).  I fired the tool with the following --alter option:


This basically told the tool to alter the table, and just move the id column on top.
pt-osc diligently created a copy of the table with the PK moved to 1st column, and filled it with all data from the original table that I had previously altered; then swapped the original and the new table.
That did the trick, since all the operations performed by pt-osc got propagated to all nodes of the cluster via replication. When pt-osc completed, the table was consistent on all nodes, and with a brand new auto_increment PK on top.

Mission accomplished!!

Bottom line: remember to always add primary keys to your tables...


There is an important step that I forgot to mention in the above blog post (please bear with me) .  Of course the node that you set to RSU, and  where you run the ALTER statement(s), should not be receiving any write traffic during the operation, to avoid stalls.

But the very important step is: after you ALTER all your tables on the above node, and *before* you start running your pt-online-schema-change session(s), you need to make this node your only master (all writes need to go to this node!) until you're done with pt-osc. 
The reason is that the CREATE TRIGGER statements that pt-osc uses to keep your new table up to date while it runs, will fail on the other nodes, because the triggers needs to be created on the column you have  just added, and that column is not present on the other nodes yet.

Beware, if you don't follow the above step, all the changes performed by the applications to the table(s)  that happen while you run the pt-osc will be lost.
Make sure these changes only come to the node where you run pt-osc, and everything will be safe.

Wednesday, December 14, 2016

In place transparent compression of MySQL binary logs

Any DBA who has administered a busy master knows how fast the disk space occupied by binary logs may grow.  DBAs have no control on this: the growth depends on the workload, and the workload depends on many factors, e.g.:

- application changes (the applications start writing more due to code changes)
- traffic changes (the peak season arrive, your workload doubles in size)
- infrastructure changes (the devops add more servers)
- business changes (new business flows adds to existing workload)

So either you have being thoughtful and have planned in advance for a large enough storage space (to handle the increase in number of binary logs), or,  sooner or later, you will face the usual dilemma - how many retention days dare you give up to accommodate for the binlog growth?

In my case, I was very thoughtful, but the boss didn't listen and gave me servers with very limited binlog storage space and, more important, using internal disks so no way to add more disk space on demand. Pity, as I had previously found a way to move binlogs to a different filesystem without downtime. But unfortunately, there is no "different filesystem" this time. I'm stuck with what I have.

So, the obvious answer at first was to reduce the retention of binary logs via the
expire_logs_days variable. Not the ideal option, as in my situation, I had to reduce it to a mere 2 days, not even enough to survive a weekend. At the same time I didn't like the idea to keep more days and be waken up at 3am by a filesystem full alert.

So the idea of compressing the binary logs came to my mind, but I wanted to do it in a transparent way, preserving the existing binlog retention mechanisms (expire_logs_days variable and PURGE BINARY LOGS command should continue to work as before) and without breaking the MySQL server and/or replication. Ideally, the timestamps of the files themselves should also preserved.

I decided to try and compress the binary logs in place. In place means that when you look at your binlogs folder, you don't notice anything unusual.  Here's (a part of ) the listing of one of our servers with compression active:

-rw-rw---- 1 mysql mysql  13631245 Dec 14 14:17 dbmd01.079826
-rw-rw---- 1 mysql mysql  13909321 Dec 14 14:23 dbmd01.079827
-rw-rw---- 1 mysql mysql  13656948 Dec 14 14:30 dbmd01.079828
-rw-rw---- 1 mysql mysql  13915222 Dec 14 14:37 dbmd01.079829
-rw-rw---- 1 mysql mysql 104883640 Dec 14 14:44 dbmd01.079830
-rw-rw---- 1 mysql mysql 104898032 Dec 14 14:50 dbmd01.079831
-rw-rw---- 1 mysql mysql 104861122 Dec 14 14:57 dbmd01.079832

You can spot which files are compressed by looking at the file size, but everything else is preserved. Looking with file, here's what's in each of them:

dbmd01.079826: gzip compressed data
dbmd01.079827: gzip compressed data
dbmd01.079828: gzip compressed data
dbmd01.079829: gzip compressed data
dbmd01.079830: MySQL replication log
dbmd01.079831: MySQL replication log
dbmd01.079832: MySQL replication log

Basically, after compression, the compressed file retains the original file name. This is to keep MySQL happy - you know, these files are tracked in the index file and read when mysqld starts.
Speaking of which, here's what happens when you start mysqld  with in-place compression:

2016-10-26 03:06:13 0 [Note] /usr/sbin/mysqld (mysqld 5.6.34-79.1-log) starting as process 1407 ...
[ ... ]
2016-10-26 03:06:13 1407 [Note] InnoDB:  Percona XtraDB ( 5.6.34-79.1 started; log sequence number 2722915
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL
2016-10-26 03:06:13 1407 [ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL

Although these entries are flagged as ERROR, actually mysqld just ignores the compressed files, and more importantly, does not touch them.  

This way, compressed binlog files can be managed exactly like uncompressed, standard ones. Of course, I need to remember that if I need to peruse their content, I have to uncompress them before feeding them to the mysqlbinlog utility.

Last but not least, in order to ensure that we do not inadvertently compress binlogs that haven't been downloaded by all slaves yet,  the following extra safety rules are enforced:

- do not compress any binlog which is not at least 1 hour old 
- only compress binlogs that have a sequence number which is less than the current sequence number  of the most lagging slave

The script runs every 10 minutes, extracts the list of the binary logs which aren't compressed yet, applies the above rules, then it proceeds to compress them.

I have set expire_logs_days back to 7 days and have plenty of disk space for extra growth now....

You can find the which I crafted for the above on my GitHub page.

One caveat  - script hasn't been tested with binlog sequence numbers greater than 999,999. I am actually curious to see what happens to the binlog file name when we reach one million - need to either be very patient, or have a look at the source code... :-)



Thursday, November 3, 2016

Can RocksDB and InnoDB play along in MySQL?

My recent post about importing a big dataset from InnoDB into MyRocks has attracted quite a lot of attention (thank you Mark!) and also it has been pointed out that what I wrote about coexistence of MyRocks and InnoDB was incorrect.

In fact, I had been asking about it at Percona Live last month, but got a negative answer.... plus I had tried it at first but got a mysqld crash during crash recovery once, so since it was not important for my use case I went ahead and disabled InnoDB entirely.

But of course, as I have written previously, using both engines in the same server is something that I would really, really love to have, so I thought to give it a try with a very simple use case.

So I created the following two tables, each one with a different engine:

Create Table: CREATE TABLE `some_innodb_table` (
  `descr` char(10) COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`id`)


Create Table: CREATE TABLE `some_rocksdb_table` (
  `descr` char(10) COLLATE latin1_bin DEFAULT NULL,
  PRIMARY KEY (`id`)

The idea was to fill the two tables with some data, then kill the mysqld process and verify that the crash recovery was able to recover both engines fully.

Speaking with my friend, the great Ren√© Canna√≤, he suggested a very simple way to set up a quick test for the above. Basically, it would be enough to prepare a file with the needed insert statement in them  (one file per engine) and then execute many parallel mysql clients, feeding the file with statements into them.

So I quickly created the two files :

$ for i in $(seq 1 1000000); do echo "insert into some_rocksdb_table values (NULL, '$i');"; done > rocksdb_testfile.sql
$ for i in $(seq 1 1000000); do echo "insert into some_innodb_table values (NULL, '$i');"; done > innodb_testfile.sql

The generated files would look like this:

insert into some_innodb_table values (NULL, '1');
insert into some_innodb_table values (NULL, '2');
insert into some_innodb_table values (NULL, '3');
insert into some_innodb_table values (NULL, '4');
insert into some_innodb_table values (NULL, '5');

Now, I needed to instrument the parallel launch of several mysql clients. The idea is that when the mysql server gets killed, each of the clients will spit out the row number for the statement that it was executing when the server died; so I could just get that number for each of the clients, subtract one, and easily get the number of committed transactions that I should find on each engine. This should work because we are running in auto-commit mode, so basically each statement that we send is within its own transaction.

The trouble here was to capture the information spit out by the mysql client  in an ordered manner, as all the clients would be launched in background from the shell simultaneously. So I crafted a small script that captures each output and saves it to a dedicated log (I had to be able to differentiate the sums of the two engines).

echo -n "Password: "
stty -echo
read pass
stty echo
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_1.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_1.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_2.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_2.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_3.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_3.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_4.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_4.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_5.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_5.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_6.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_6.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_7.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_7.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_8.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_8.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_9.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_9.log &
cat innodb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> innodb_10.log &
cat rocksdb_testfile.sql | mysql -u rpizzi -p$pass -S /storage/rockstat/data/mysql.sock rocks_db 2> rocksdb_10.log &

The script will launch 20 mysql clients in parallel (by running them in background), 10 for each engine, and capture their output for later perusal. Each client will be fed one million inserts, so if you let this run to completion, it will insert 20 million rows, 10 million per table.

Before launching the above script I restarted the server with the proper setting for a crash safe master, following the recommendations from the Wiki. The config used is the one below; notable options here are sync_binlog=1, innodb_flush_log_at_trx_commit=1 for  (InnoDB) and rocksdb_disable_2pc=OFF (for MyRocks).

Please note that the coexistence of any two engines in MySQL at this time has problems around XA support - I got repeatable crashes and reported these here even if this is very likely an upstream bug.  I have been recommended to enable binary logging to work around this problem, and it actually worked fine when I did.

Another issue I stumbled into while trying to set up this small test is a bug regarding the auto_increment_increment setting, which I accidentally had set to 2 (we use this setup in clusters).  You can check this particular issue here. Hey, turns out I have been quite lucky with my previous use case :-)

innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 10G

innodb_log_buffer_size = 64M


binlog_format                  = ROW
transaction-isolation = READ-COMMITTED
log_bin                        = /storage/rockstat/binlog/rocks
expire_logs_days               = 1


Well, I launched my script and waited a little while. While waiting I was curious to compare the insert speed between InnoDB and RocksDB:

mysql> select count(*) as innodb from some_innodb_table; select count(*) as rocksdb from some_rocksdb_table;
| innodb  |
| 4544377 |
1 row in set (1.28 sec)

| rocksdb |
| 4361642 |
1 row in set (1.85 sec)

I was very happy to verify that RocksDB was able to keep up with InnoDB in this little insert speed test :-) Actually, InnoDB was about 10% faster, but remember that we are running with MyRock's 2 phase commit enabled here (where it is normally OFF). Also, InnoDB insert performance will weaken over time, where MyRocks should hold just fine.

Anyways, I went and killed the mysqld process with kill -9.  My script exited and left 20 log files for my perusal:

bash-4.1$ ls -l *log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_1.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_10.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_2.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_3.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_4.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_5.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_6.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_7.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_8.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 innodb_9.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_1.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_10.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_2.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_3.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_4.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_5.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_6.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_7.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_8.log
-rw-r--r-- 1 root root 153 Nov  3 12:45 rocksdb_9.log

Each file contained the row corresponding to the statement that each client was executing, for example:

-bash-4.1$ cat rocksdb_1.log 
Warning: Using a password on the command line interface can be insecure.
ERROR 2013 (HY000) at line 476838: Lost connection to MySQL server during query

In this example, the client had inserted 476837 rows successfully and died on the 476838th.

Another small script helped me gather the final information that I needed to compare with a count(*) later:

c=$(echo $(fgrep ERROR innodb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "InnoDB count: $((c-1))"
c=$(echo $(fgrep ERROR rocksdb*log | cut -d " " -f 6 | cut -d":" -f 1 | tr "\n" "+" | sed -e "s/+$//") | bc)
echo "RocksDB count: $((c-1))"

Here's what I got after the kill:

InnoDB count: 4985188
RocksDB count: 4769893

Now, it was time to start MySQL again and let it do crash recovery, then compare the # of rows in each table with the above numbers, hoping for a match.

Unfortunately the fun stopped here,  with the following assertion:

mysqld: /home/rpizzi/mysql-fb/rocksdb/db/ void rocksdb::VersionBuilder::Rep::CheckConsistency(rocksdb::VersionStorageInfo*): Assertion `f1->largest_seqno > f2->largest_seqno || (f1->largest_seqno == 0 && f2->largest_seqno == 0)' failed.
12:40:42 UTC - mysqld got signal 6 ;
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x50000

Basically, while InnoDB recovered  just fine, crash recovery for MyRocks failed with an assertion. I have been unable to recover from this situation (even using suggested recovery modes on MyRocks wiki) and had to delete all data :-(

I have raised an issue and hope to be able to repeat this test soon....

Monday, October 31, 2016

MyRocks: migrating a large MySQL dataset from InnoDB to RocksDB to reduce footprint

I have been following Facebook's MyRocks project (and Mark Callaghan's blog) for a long time. The idea of an LSM based engine for MySQL is actually a great idea.

We all know that InnoDB sucks at INSERTs.  BTree in general sucks when it's about insertion speed, and the more rows you insert, the more it sucks at it. There are many blog posts on the web that shows the insert speed degradation in InnoDB when the amount of rows in the table grows. Things get much worse faster if your primary key is a random key, for example an UUID.

We hit this problem with our caching servers (yes, we do caching with MySQL!), and in order to be able to scale these servers up we moved since a couple years to the TokuDB engine with great success. TokuDB is based on fractal tree technology, and guarantees the same insert speed, no matter the number of rows you have in the table; furthermore, it has no problems with random INSERTs (inserting rows with random PK values) and it also achieves great compression when compared to InnoDB.

Enter RocksDB and the LSM (Log Structured Merge) technology: we are entering a new planet as far as data storage goes. The idea of an "append only" type of approach for writing database rows is something I never heard before. Databases usually are the most random writer ever! And although SSD will not suffer from write randomness, there are other factors that need to be taken into consideration. I'll not go into more details about how LSM or RocksDB work, or why LSM is good for your SSD and your workload, as there is plenty of information about it on Mark's blog (see above) or on the official page for RocksDB and for MyRocks (the RocksDB engine for MySQL) on GitHub; just check the Wiki there.

For quite some time I dreamed  to give MyRocks a try with our giant (at least by our standards) write-mostly tracking database, currently occupying about 2.1T of disk space and growing about 200G per month (and yes we DO have data retention in place).  This cluster has a very write intensive workload and the written data is  very seldom read, usually to extract statistical reports, or for application or incident post mortem analysis. At the time of this post, it contains approximately 830 tables.
So I was really excited to hear at the recent Percona Live AMS  Yoshinori Matsunobu announcing that MyRocks was ready for production use.  I explicitly asked him if partitioning was supported by MyRocks (as we use range partitioning to prune old data) and got a positive response. Yay!!

Even if production ready, MyRocks is still in early stages and therefore I needed to clone the git repository and compile the server from source code. I also had to compile the GCC compiler first, as it is a good idea to use one of the GCC versions that were tested by the Facebook team so that I can report issues that may happen later. Peter Z announced that MyRocks will be included in Percona Server soon, but if you want to give it a try now, this is the only way.

Next step was to obtain a consistent dump of the cluster, along with master position in order to add the MyRocks slave to the cluster after the data import. I knew this dump was going to take a long time, so instead of using mysqldump  I decided to use mydumper, the parallel dump-and-load tool that I have used many times in the past, especially to migrate out of Amazon RDS (anyone?). 

Also, I knew that some editing of the dump was required in order to accomplish the following:

- remove references to InnoDB and/or replace them with ROCKSDB
- introduce binary collation required by MyRocks  (insert  "COLLATE latin1_bin" or "COLLATE utf8_bin" in the table definitions)
- remove InnoDB compression where enabled (many of the larger tables are compressed)

So I downloaded, compiled and installed mydumper and after a whopping 44 hours, my dump was complete (and occupying about 2T of disk space uncompressed).

MyDumper creates two files for each table, one contains the table creation command and the other one contains the actual dump of the table's data.  I crafted a simple script that goes through all the table creation files and applies the changes above in them:

for f in bigdump/*schema*sql
  fn=$(basename $f)
  sed -e "s/InnoDB/ROCKSDB/g" -e "s/ DEFAULT CHARSET=latin1/ DEFAULT CHARSET=latin1 COLLATE=latin1_bin/g" -e "s/ DEFAULT CHARSET=utf8/ DEFAULT CHARSET=utf8 COLLATE=utf8_bin/g" -e "s/ ROW_FORMAT=COMPRESSED/ /g" < bigdump/$fn > bigdump/$fn.mangled
  mv bigdump/$fn tmp/$fn.orig  
  mv bigdump/$fn.mangled bigdump/$fn

So I was ready to import this 2T of stuff into MyRocks. Of course, this is also a very time consuming operation, when you have more than a few G's to import. So it is really important to use all the available optimizations in order to speed up the bulk load

Turns out that MyRocks contains quite a few tweaks to help with this; the most notable ones being rocksdb_skip_unique_check, which disables checking on unique constraints, and rocksdb_commit_in_the_middle which only commits every 1,000 rows or so. More info about options suitable for bulk loading data can be found on the Wiki page here.   Beware that these settings cannot be used in regular production mode, they can cause corruption of your data! Therefore it is best to create a dedicated config for bulk data loading purposes only, and use the standard "production" config after the import is completed.

Side note: before you attempt to load large datasets into MyRocks, make sure you set your open file limit high enough! I have learned it the hard way, that MyRocks keeps one file descriptor open for each SST file you have in your database, and the number of SST files can grow very high when you are importing lots of data. In my case, I have topped 34,000 open file descriptors when the load was completed, which are then kept open even during normal use. I had originally reported this issue and following that a better default was recommended to me for the variable target_file_size_base in rocksdb_default_cf_options. Changing the recommended value from 32m to 64m made a big change regarding the amount of file descriptors required, dropping the above number to 12,500.

After a couple failed attempt at importing the data (mostly due to me being too lazy to properly read the available documentation) I finally managed to complete the import with satisfying results, using the configuration below.  You will notice by reading the config that MyRocks cannot at this time coexist with InnoDB. I have been inquiring with Yoshinori about this, and the reason is that at this time, the crash recovery mechanism of the two engines will conflict. I can't wait for the day when we will be able to use RocksDB and InnoDB in the same server!


# config for bulk load


During the import I have been watching the disk space (footprint) occupied by MyRocks' data directory. MyRocks creates a .rocksdb subfolder in your datadir and stores all its files in there, so I ran a script to monitor both the # of file descriptor in use (with lsof) and the footprint (with du) every 60 seconds:

223G /storage/rockstat/data/.rocksdb
229G /storage/rockstat/data/.rocksdb
219G /storage/rockstat/data/.rocksdb
220G /storage/rockstat/data/.rocksdb
226G /storage/rockstat/data/.rocksdb
211G /storage/rockstat/data/.rocksdb

You can clearly see the compaction of data taking place while the data loads. Compaction pushes data to lower levels, packing and compressing it in the process and therefore the footprint decrease while the import progresses.  Cool!!

But what was really cool is finding out that MyRocks was able to pack 2.1T worth of data into a mere 611G  - reducing an already compressed dataset by more than 3 times! This is even more amazing if you consider that about a dozen of the larger tables in this particular dataset contain BLOBs which have  compressed data in them. In the steps above I had disabled compression at the engine level where it was configured in the source dataset, but of course compressed data in BLOBs will remain compressed and MyRocks did a very good job at compressing regardless.

Importing the dump took about 17  hours. During the import I ran into some stalls that I reported here. Once the import completed I shut the server down and switched the my.cnf file used for bulk load with the "production" one below. Please note that I had to keep rocksdb_block_cache_size low due to other stuff using memory on the test server at the same time;  you should raise this accordingly to the amount of memory available, keeping in mind that MyRocks, similarly to TokuDB, also uses filesystem cache extensively. I didn't find a recommendation on the MyRocks wiki, so I decided to set this parameter to 50% of the available server memory, which is the recommended value for Toku



Using the information saved by mydumper in the "metadata" file I set up replication from the actual InnoDB based slave to the MyRocks slave and let it catch up. I used MTS (Multi Threaded Slave) to speed up the catch up (parallel replication happens to work very well for this cluster's workload) and the slave caught up pretty quickly. After the catch up, I stopped the slave and disabled parallel replication, as it is not compatible with relay_log_recovery in case of a slave crash, and the latter option is needed by MyRocks to guarantee a crash safe slave.

I quickly found out that having a slave which uses an engine different than the master needs also some other tweaks in the configuration. Remember that I had to disable InnoDB in order to make MyRocks work properly.  So what happens when on the master some DDL is executed which explicitly references ENGINE=InnoDB?

Here's the answer:

               Last_SQL_Errno: 1286

               Last_SQL_Error: Worker 7 failed executing transaction '' at master log dbstat02.034383, end_log_pos 99530607; Error 'Unknown storage engine 'innodb'' on query. Default database: 'rick_rocks'. Query: 'create table ttable2 (id int not null auto_increment primary key) engine=innodb'

D'oh! Of course, this is not what I wanted. I quickly remembered about a SQL_MODE option that I always wondered what  could be used forNO_ENGINE_SUBSTITUTION

From the manual page:

With NO_ENGINE_SUBSTITUTION disabled, for CREATE TABLE the default engine is used and a warning occurs if the desired engine is unavailable. For ALTER TABLE, a warning occurs and the table is not altered.

Hey, sounds like exactly what I want!! We never use ALTER TABLE anyways as we do everything  via pt-online-schema-change, so this option should be all we need on the slave.

This option (NO_ENGINE_SUBSTITUTIONis enabled by default since MySQL 5.6.6, so I needed to revert the sql_mode to a blank string, by setting in my.cnf the following:


To my dismay, after restarting the MyRocks slave, the behaviour was unchanged.  A bit of digging in the MySQL source code and I discovered that the slave uses the master's SQL_MODE, and not its own! The mode gets written to the binary log for each and every transaction, and is used by the slave when applying events.
So, I went to the master and changed the SQL_MODE there. After restarting replication, now the behaviour is correct and a table created with Engine=INNODB is created with engine ROCKSDB on the slave.

Finally, here's a comparison of the disk usage between a classic "InnoDB" slave and our new "RocksDB" one,  over the last 48 hours:



Can you spot without using the caption which one of the two is our MyRocks slave?  The compaction algorithm can be clearly spotted in the graph.  The two big drops around 6am is when the older partitions are dropped every night to apply retention.

The graphs above show that the MyRocks slave grew from 47.75% to 48%, while the InnoDB slave grew from 77% to 78%.  What  they don't show is that the MyRocks storage is  a mere 1.4T where the InnoDB one is double that, at 2.8T!

So, bottom line is, MyRocks grew 0.25% of 1.4T = 3.5G while InnoDB grew 1% of 2.8T = 28G. This is a 7x improvement!

We will be monitoring this new slave for some time trying to compare CPU metrics and to observe general behaviour, like lag and general reliability.

But for now, I would say that MyRocks really ... rocks!!!  Well done FB Team!