Monday, December 11, 2017

MySQL 5.7 reads all your binlog files more often than you think

After upgrading some of our slaves to latest 5.7, I have found  what looks like a serious regression introduced in MySQL 5.7.
A couple weeks ago I noticed that the error log file of one of our clusters, where I had implemented my in place transparent compression of binary logs,  was literally flooded by the following error:

[ERROR] Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL

In the above setup this is  an harmless error, and it should only happen at server startup, where mysqld opens and reads all available binary log files.  The error is due to the fact that since files are now compressed, mysqld doesn't recognize them as valid - not an issue, as only older files are compressed, and only after ensuring all slaves already got them.

But I couldn't understand why the error was now hitting every minute or so, after upgrading from 5.6 to 5.7.

After a short investigation I was able to relate the errors to the exact moment where a binary log was closed and a new one created, this normally happens when the binary log file size reaches the value you have set in max_binlog_size variable. But this indicated that mysqld was continuously trying to open and read all the available binary logs, each time a file was rotated.

Ouch! This wasn't a nice finding.

On this server we have in excess of 3,500 binary log files, and I know about installations which have much more (mind you, we have a retention of 4 days only here). We like to keep binlog files small, we have them sized at 100 MB as this makes easier to peruse them when troubleshooting with mysqlbinlog. Anyways this wouldn't change anything, I mean, having 3,500 100M files or 350 1G files would generate exactly the same amount of I/O when you wanted to read them all....

Anyways, I had just found that every time a binary log is rotated, mysqld needs to open all of the files. What a waste of I/O and performances, I thought.  So I decided to have a look at the code to understand what mysqld was actually trying to do with those binlogs.

It wasn't too difficult to narrow down the binlog scan with the function  purge_logs_before_date()
that expires logs based on time, as set by the expire_logs_days variable.
But reading the binary log files shouldn't be really needed when you're only looking to expire them, so this looked weird to me.

Further digging into the source code shows that mysqld is actually looking for "lost GTIDs" in the binlog files, and does it every time a binary log is closed for rotation. I haven't looked deeper into what a "lost" GTID means in the code, however whatever the reason for this search is, for sure it shouldn't be done repeatedly at every binlog rotation.

The weird thing is, we aren't even using GTIDs!
But still mysqld opens and reads ALL the available binary log files all the time. This is a serious regression compared to 5.6 where this wasn't happening, and it can hit you hard when you have a large number of binary log files. Caveat emptor.

I have raised a bug for the above, and really hope Oracle gives it the needed attention.


Monday, November 6, 2017

when xtrabackup fails with 'read' returned OS error 0 after upgrading to 5.7

Here's something that has puzzled me for several weeks.
Right after migrating  MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:

171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.


Searching the web didn't yield any useful result.
The error was happening on one slave but not on  another, OS, MySQL and Xtrabackup versions being the same:
  • CentOS release 6.8 (Final)
  • xtrabackup version 2.4.8 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 97330f7)
  • mysqld  Ver 5.7.19-17 for Linux on x86_64 (Percona Server (GPL), Release 17, Revision e19a6b7b73f)
To properly identify the table where the backup was failing I had to set parallelism to 1 (option --parallel=1 of xtrabackup). Examining the table with SHOW CREATE TABLE didn't show anything particular, the backup was failing on different tables and schemas, and tables were all different from each other and didn't have anything special,  or at least nothing that would  give me an hint.

I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.

This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had  an intuition:  since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.

Verifying my suspect was as easy as running:

  ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB

Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.

Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.


ADDENDUM
 
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.

Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.

Monday, July 3, 2017

SET PASSWORD will lock you out of your slave in a mixed 5.6/5.7 MySQL topology

Long time no post.... :-)
Here's something interesting.

Last week I decided to give MySQL 5.7 a try (yes, I am kinda conservative DBA...) and the very same day that I installed my first 5.7 replica I noticed that, after changing my own password on the 5.6 master, I could no longer connect to the 5.7 slave.

Very annoying, to say the least! So I went and dug out the root password (which we do not normally use) and when I connected to the slave I was surprised to see that my password's hash on the 5.7 slave was different than the hash on the 5.6 master. No wonder I couldn't connect....

A bit of research on the MySQL documentation and I understood that 5.7 introduced few changes around the way you work with users' passwords.  SET PASSWORD is now deprecated in favour of ALTER USER: see MySQL 5.7 Reference Manual page.

So far so good, but it's obvious that when Oracle implemented those changes, they didn't consider upgrade paths, or mixed 5.6/5.7 topologies that are commonly found  during a major MySQL version upgrade, like in this case.

Turns out there is a bug first reported by Gillian Gunson back in September 2016 (SET PASSWORD replicates incorrectly between 5.6 and 5.7) which Oracle acknowledged but didn't care to fix at the time of this blog post. Sorry for the rant, but I really feel this to be a real showstopper to any migration where there are many database users and where a password change may happen quite often.

As Gillian pointed out in the above bug, what really happens is that the password is hashed twice, and on the replica we get an hash of the already hashed password - no wonder it doesn't work :-)

The good news is that this is really an easy fix. I was able to patch this myself in few minutes, and the patch is one line of code....  you can find the diff below, this was applied to 5.7.18 Percona source code  but I believe it will  be the same in Oracle MySQL, regardless the minor version.

Basically, I added a check to avoid hashing the password if the thread is a slave thread and the command is  a SET PASSWORD. This check may break the use of SET PASSWORD when master will be on 5.7, I didn't check that - but hey, SET PASSWORD is deprecated anyways... and I will only use my patched version of 5.7 during the migration.

Enjoy!

*** sql/auth/sql_user.cc Mon Jul  3 16:18:55 2017
--- sql/auth/sql_user.cc.patch Mon Jul  3 16:07:01 2017
***************
*** 565,570 ****
--- 565,574 ----
    */
    if ((Str->uses_identified_by_clause ||
        (Str->auth.length == 0 && !user_exists)) 
+      // Rick Pizzi: don't change to hash if thread is a replication thread and a 
+      // SET PASSWORD was issued on master, to avoid double hash
+  // (fix for https://bugs.mysql.com/bug.php?id=82979)
+  && (!thd->slave_thread || thd->lex->sql_command != SQLCOM_SET_OPTION))
    {
      st_mysql_auth *auth= (st_mysql_auth *) plugin_decl(plugin)->info;
      inbuf= Str->auth.str;

Tuesday, May 9, 2017

Slides from Percona Live 2017: Caching your application data with MySQL and TokuDB

Here are the slides from my talk at Percona Live 2017 conference in Santa Clara, CA.
This has been my first opportunity to speak to a public audience and it has been really great!
Looking forward to speak about MySQL again soon. Thank you!!

Wednesday, April 5, 2017

Caching your application data with MySQL and TokuDB @ Percona Live 2017

The great Percona Live 2017 database conference is approaching, and I am very excited to be there this year - not only as an attendee, but also as a speaker!!
By the way, this will be my first public talk so please bear with me if I'll be a bit nervous :-)

If you are attending, come see me and my colleague Andrea speaking about how we leverage the TokuDB engine to cache our application data using MySQL! We will show you some interesting benchmarks, and will describe our design, setup, and configuration in detail.

See you there!!

Tuesday, March 7, 2017

Using xtrabackup with MariaDB and multi source replication

Quick post today, because I wanted to share a snippet (actually, a patch) that allows the great xtrabackup tool to take backups of MariaDB when multi source replication is in use.

Granted, this is supported already, except the slave(s) position is not saved when the backup is taken, and this somewhat defeats the purpose of the backup, which should be used to rebuild the server and resync it with its master(s) in case of disaster.

A workaround for using xtrabackup with MariaDB exists, but only works when you have a single master: just set the default_master_connection variable in your my.cnf to the name of your master connection,  and you are set - in fact, in this situation, SHOW SLAVE STATUS will work exactly the same as in stock MySQL, and xtrabackup is happy.

The trouble begins when you have more than one master.
Each master connection has its own name, and in order to get the slave position you have to issue a SHOW SLAVE 'connectionname' STATUS.

A more easy way to get information for all of them is to issue a SHOW ALL SLAVES STATUS, which will report the same information as SHOW SLAVE STATUS for each connection that you have defined.

The bad news is that this syntax is not understood by xtrabackup, so if you happen to use MariaDB with multiple masters, you have no way to save the slave position of each connection when you take a backup.

So here comes the little snippet below - actually a UNIX style diff, that will work against the innobackupex script which is part of xtrabackup, tested with 2.2.12 and above. It will patch the script so that information for each connection is properly saved into the xtrabackup_slave_info file when backup is taken.

This will not work on version 2.3 because the innobackupex script is now an hard link to the xtrabackup binary, so no easy patching is possible.

Please note - only tested with old school replication, so if you are using GTIDs you'll need to modify the patch, but shouldn't be too hard if you follow the changes I did.

Hope this helps. Enjoy!

$ diff -u /usr/bin/innobackupex /usr/bin/innobackupex.rick
--- /usr/bin/innobackupex    2016-02-02 11:20:55.000000000 +0100
+++ /usr/bin/innobackupex.rick    2017-03-07 16:29:19.000000000 +0100
@@ -3263,12 +3263,27 @@

     if (!defined($master) || !defined($filename) || !defined($position)) {
         my $now = current_time();
+    my $mspos = '';

-        print STDERR "$now  $prefix Failed to get master binlog coordinates " .
-            "from SHOW SLAVE STATUS\n";
-        print STDERR "$now  $prefix This means that the server is not a " .
-            "replication slave. Ignoring the --slave-info option\n";
+    # Rick
+        print STDERR "$now  trying multisource replication setup\n";
+     my $rows = $con->{dbh}->selectall_hashref("SHOW ALL SLAVES STATUS", 'Connection_name');
+    foreach my $conn_id (keys %$rows) {
+            $mspos = $mspos
+                        . "Connection $rows->{$conn_id}{Connection_name} "
+                        . "Master Host $rows->{$conn_id}{Master_Host}, "
+                        . "File $rows->{$conn_id}{Relay_Master_Log_File}, "
+                        . "Position $rows->{$conn_id}{Exec_Master_Log_Pos}\n";
+    }
+        write_to_backup_file("$slave_info", $mspos);
+    if ($mspos eq '') {
+
+            print STDERR "$now  $prefix Failed to get master binlog coordinates " .
+                "from SHOW SLAVE STATUS\n";
+            print STDERR "$now  $prefix This means that the server is not a " .
+                "replication slave. Ignoring the --slave-info option\n";

+    }
         return;
     }









Thursday, February 23, 2017

When order of appearance of indexes matters in MySQL

Sometimes MySQL surprises you in ways you would have never imagined.

Would you think that the order in which the indexes appear in a table matters?
It does. Mind you, not the order of the columns - the order of the indexes.
MySQL optimizer can, in specific circumstances, take different paths, sometimes with nefarious effects.


Please consider the following table:

CREATE TABLE `mypartitionedtable ` (
  `HASH_ID` char(64) NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
  KEY `HASH_ID_IX` (`HASH_ID`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(EXPIRE_DATE))
(PARTITION p2005 VALUES LESS THAN (1487847600) ENGINE = TokuDB,
 PARTITION p2006 VALUES LESS THAN (1487851200) ENGINE = TokuDB,
 PARTITION p2007 VALUES LESS THAN (1487854800) ENGINE = TokuDB,
 PARTITION p2008 VALUES LESS THAN (1487858400) ENGINE = TokuDB,
[ ... ]


This is a table where we store key-value data for a cache. The table is partitioned by range on the cache expire date, so we can actually apply retention to the cache data.

Now, consider the following DML:


DELETE FROM mypartitionedtable WHERE HASH_ID = 'somehash' AND EXPIRE_DATE > NOW() ;

One would think that the plan for the query above is optimal: it goes by HASH_ID_IX index and also prunes partitions, by only looking at partitions with non-expired data.

In reality, it depends. 
Statistics for partitioned tables in MySQL are only computed on the largest partition. 
This works well most of the times, *except* when the largest partition becomes one of the expired ones.

Let's see this in detail...

We have several partitions, each partition contains a different day worth of data, based on values of the EXPIRE_DATE column. For the purpose of this post, let's assume partition p654 contains yesterday's data, and partition p666 contains data with an expire date which falls tomorrow. 

If we explain a query for a partition that contains valid cache data, i.e. a partition that has all rows with EXPIRE_DATE in the future, everything looks good: you can see that the cost for a search by HASH_ID is much less than the cost for a search by EXPIRE_DATE, as highlighted by the value of the "rows" column:


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows   | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL | 291978 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.01 sec)


But what happens if we explain the same query for a partition which has all rows with an EXPIRE_DATE in the past?


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL |    1 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
1 row in set (0.01 sec)


Whoa, cost is the same for both queries now!

This has a very bad effect, as the optimizer will choose to use EXPIRE_DATE to select our rows and disregard HASH_ID completely.  And it will do this for all partitions, not just this one (where the plan actually would be ok).

Our DELETE above becomes now a locking scan of all partitions having EXPIRE_DATE > NOW(), this means (in our case) almost the entire table.

Ouch!!

This is in my opinion a bug - the fact that the optimizer only looks at the largest partition is ... sub-optimal, if you ask me. But in this case, the optimizer should also evaluate that EXPIRE_DATE is a partitioning key and choose the other index for the query, which would be correct.



Now for the fun part of this post.  

I was puzzled about what was causing the optimizer to pick one index and not the other, considering that they had identical cost. Unfortunately, explain wasn't telling me anything useful, so I decided to instrument some debugging to find out.

Debugging the MySQL server actually involves either recompiling from source with debugging option enabled, or (for lazy DBAs like myself) installing the debug package from your vendor of choice.

I was glad to find out that the stock Oracle distribution (Community edition) already includes the debugging binary of the server, called mysqld-debug. So I stopped the server, replaced mysqld with mysqld-debug, and restarted.
This was not enough, however, to have any debug output.  A quick look at the manual and I found that there is a variable that needs instrumented, called simply debug. More information about debugging MySQL is here, however all it's needed is to set this global variable as follows:

mysql>set global debug='d:L:F:O';

This will cause debugging output to go to the MySQL error log. It will print debugging output prefixed by source code file and line , useful if you want to look at source code to understand what's going on. 

So I ran the explain for the DELETE on the partition with yesterday data, and here is what I got (uninteresting debugging output removed for clarity):

 opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 17:28:51 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f58c00b1fd8 (ROR scans)  scans: HASH_ID
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key EXPIRE_DATE_IX, cost 5.81, records 4

By looking at the output above, it was confirmed that the optimizer was giving the two indexes the same exact cost: 5.81.  The index on EXPIRE_DATE was selected (chosen: 1) and the other one discarded (cause: "cost").

To me, this looked quite like an arbitrary decision to pick the first index of the set, all costs being equal. 

Makes sense? 

Well, it does, if cost is really identical, but in my case, I think the optimizer should have considered that one of the two indexes is a partitioning key, and pick the other. 

I said this was an arbitrary decision. 
But, the same decision could have been: pick the *last* index of the set, instead of the first. 
Hey, let's patch the source code, I thought. But then an idea came to my mind.  What if the order of appearance of the indexes in the table was reversed? After all, the code is simply scanning all available indexes, one at a time, evaluating them and assigning each of them a cost.

I quickly created another identical table, but with the order of the two indexes reversed:

CREATE TABLE `BIG_STORAGE_INNO` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` datetime NOT NULL,
  KEY `HASH_ID_IX` (`HASH_ID`),
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
 PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
 PARTITION p656 VALUES LESS THAN (736743) ENGINE = InnoDB,
 PARTITION p657 VALUES LESS THAN (736744) ENGINE = InnoDB,
 PARTITION p658 VALUES LESS THAN (736745) ENGINE = InnoDB,
 [ .... ]


You can see that the order of the columns is the same as before, but the order of the two indexes is reversed - now HASH_ID_IX comes first. 

Guess what happens now in the optimizer?

opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 18:20:05 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f7da4095838 (ROR scans)  scans: HASH_ID_IX
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key HASH_ID_IX, cost 5.81, records 4


Bingo!!  The right index comes first now, and is chosen in every situation.
My problem is solved - well, I call this a workaround, although an interesting one.

I have raised a bug on the subject, I really believe that in this situation, the fact that one of the two identically expensive indexes is a partitioning key should be taken into account.

For sure this has been an amusing troubleshooting... and an interesting finding!


ADDENDUM:

I have just discovered that the order of appearance of indexes also matters (a lot) when using row based replication and tables without a primary key!

 The sql thread will use the first index in order of appearance to scan the table if it needs to search rows for an update .... this may result in very long time spent searching for rows if the first index in the table is not very selective. You can spot that because the slave thread will spend lot of time in  Update_rows_log_event::find_row state for each update statement on the table without the PK. 

The fix is to make sure a very selective index appears first in the table. This may be a life saver (it was in my case!)

Hope this helps!

Rick



Monday, February 6, 2017

How to expand a striped LVM database volume in Amazon AWS without downtime

This procedure can be used to expand an LVM  database volume on Amazon AWS (but also apply to any storage area network environment equally).
Let me start with this assumption: when you create volumes for database use in AWS using EBS, you stripe data across them in order to enhance performance.  If you aren't doing this... well, you should :-)
Under this assumption, when you need to add more disk space to an existing database volume, you can't just add the disk(s) to the volume, as this would make the added space non striped, and would eventually create hotspots in the dataset. The correct approach in this situation is to create a number of new EBS disks enough to contain entire dataset plus the desired added space,so that you can grow the existing dataset while re-striping properly.

To make this clear, let's suppose you have a dataset volume of  3 TB,  made of 3 1TB EBS volumes which are striped across, but space is running out. You are in a hurry, so you quickly add a new 1 TB EBS volume to the AWS server, add it to the dataset LVM, and expand the volume, as follows:


# add the new EBS disk to LVM
pvcreate /dev/xvdl
# add the new EBS disk to the dataset volume
vgextend dataset /dev/xvdl
# extend the dataset volume using the new EBS disk
lvextend -i 1 -r /dev/dataset/db /dev/xvdl

This will add space to the dataset volume, however, the fourth EBS volume is just appended to the original striped dataset volume, not striped, thus creating possible hotspots.

The proper way to expand the dataset so that it remains striped and therefore with optimum performance, is to add 4 new EBS disks, migrate the entire dataset on them, then release the old 4 disks.

The good news is that this can be done without downtime,  if you follow the steps below.

IMPORTANT: before proceeding, make sure you have valid backups of the database!!

Now, let's suppose you have added 4 EBS 1 TB disks  to the server already, with following path:  

/dev/xvd[ponm]

The original volume, after you added the "emergency" 1 TB EBS disk above, is using:

/dev/xvd[fghl]

The procedure involves migrating the dataset from one group of EBS disks to the other by setting up a mirror, let it sync, then dropping the old mirror side to release the old disks. The important step here is to create the mirror so that it will stripe across the EBS disks properly.

# add new EBS disks to LVM
pvcreate /dev/xvd[ponm]
# add new EBS disks to dataset volume
vgextend dataset /dev/xvd[ponm]
# create a mirror out of existing dataset volume
lvconvert -b --stripes 4 -m1 --mirrorlog core /dev/dataset/db /dev/xvd[ponm]

The above commands just add the new disks to LVM, then to the volume, and then the mirror process is started.
You have to wait until the mirror sync is complete, this may take some time, even days depending on the dataset size. You can verify it is synced using  the following command and output:

# lvs -a -o +devices
  LV                     VG      Attr             LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                                            
  bck                   backup  -wi-ao----     3.00t                                                                                       /dev/xvdi(0),/dev/xvdj(0),/dev/xvdk(0)             
  db                      dataset mwi-aom--- 4.00t                                                                  100.00           db_mimage_0(0),db_mimage_1(0)                      
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdf(0),/dev/xvdg(0),/dev/xvdh(0)             
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdl(0)                                       
  [db_mimage_1] dataset iwi-aom---   4.00t                                                                                        /dev/xvdm(0),/dev/xvdn(0),/dev/xvdo(0),/dev/xvdp(0)


You need to check that he copy sync column shows 100% synced.  You can run this command to see progress at any time. The syncing is done at very low priority and there is little to no overhead, and zero impact on database.   In the example above you can clearly see that the db_mimage0 logical volume, which includes the original side of the mirror,  is actually made by 3 striped disks (f, g, h) and one single striped disk (l), this is what we want to avoid and if you look at db_mimage1 (the new mirror side) you can clearly see it is properly striped over 4 disks now.

Once the mirror is in sync we can safely drop the original side of the mirror, then remove the physical disks from LVM and destroy them on the EBS side:

# removes old side of the mirror
lvconvert -m0 /dev/dataset/db /dev/xvd[fghl]
# removes EBS disks from the dataset VG
vgreduce dataset /dev/xvd[fghl]
# removes EBS disks from LVM
pvremove /dev/xvd[fghl]


Done!! You can safely go to the AWS console and detach the EBS disks you just removed from LVM.
Be careful not to remove the wrong disks, or you will destroy your dataset and database!

Tuesday, January 10, 2017

Galera Cluster: adding a primary key to a MySQL table which lacks it... without downtime

OK, let me start with saying that a table without a primary key shouldn't be something that a DBA should ever stumble into.  Sure, InnoDB will secretly add one for us without telling - but that's not something we can use at application or administration level, it's just to make the engine happy.

So let's suppose you find some tables that lack a primary key, of course you need to do something about it, right?  Now, put Galera Cluster in the mix - Galera does not support tables without a primary key and will, secretly again, make that table inconsistent at cluster level.

You need to fix the damn table(s)!! And this is where the fun begins...  as you can't afford downtime for the operation so you need to resort to an online schema change of some type.

Galera cluster is a very nice HA solution and personally I have embraced it since few years now. I really love it, especially its ability to self-heal when something happens (and when you're in the cloud, there's  always something happening). Galera is very different from the classic MySQL replication, and the biggest difference is the fact that it replicates synchronously instead of asynchronously. Now, I will not get into details about how Galera works, there are plenty of blog posts and very good documentation online.  But there is something important that you need to be aware of, when running Galera:  in the standard operating mode of the cluster, any DDL that you will execute will freeze the entire cluster for the entire duration of the statement. They call this "Total Order Isolation" (TOI) and it is the default setting for the cluster.

Guess what. Months after migrating one of our clusters from classic master-master topology to Galera Cluster, I suddenly found myself in need to clean up some old and barely used schemas, which contained quite a few tables lacking a primary key (or any index, for that matter). My bad, I spent a lot of time checking the main schemas and tables and making sure they were optimized for Galera, but I left the unimportant ones behind and never went back checking on them.

So, here I am with my apparently simple (although boring) task - I need to add a primary key to a set of tables and I don't want downtime and/or block the cluster as a whole in the process.
Preliminary checks with DEVs confirmed that all INSERTs going into the tables I needed to fix were using an explicit field list, so the change was compatible (it is important to remember to check this, in order to avoid application failures).

So, first thing that I tried was to just add the PK using the great pt-online-schema-change tool from the Percona toolkit, which supports Galera Cluster just fine.  Unfortunately, I was quickly remembered (by an error message) that the tool cannot operate if the table lacks a primary key or unique index, except when the column that you want to be primary key is already present in the table, which was not my case. D'oh!

Failing that, I thought about running the ALTER statement directly, without using an online schema change tool.  Of course, this would have to be done on all nodes, one node at a time, after changing the wsrep_OSU_method global variable of the cluster to 'RSU'.  RSU stands for "Rolling Schema Upgrade" - more info here. In this mode, any DDL that you execute will only apply to the current node and will not be propagated to other nodes; the node will be evicted from the cluster during the operation, and will rejoin afterwards, but again, the DDL will NOT replicate. 
So you have to run it on each node in order to complete the change.  Boring, and not very practical, right. May be good to add an index on a write-heavy critical table, but of course not to change a table definition... especially if you got many tables to change! But I was prepared to do it at any cost anyways (I eventually felt guilty for not checking this before the migration), so I went to one of the nodes, changed the OSU method to RSU, and executed the following DDL:


ALTER TABLE myschema.mytable ADD id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

The operation completed  OK, but shortly thereafter, I lost connection to the node (infamous MySQL has gone away.....)  WTF?!?    I verified and the mysqld process was dead.  While it is not a big deal for a 3 node cluster to lose one node in our setup, I did not expect the node to die as a consequence of a DDL. But a quick look at the error log, and I immediately realized what the problem was:

2017-01-10 13:53:06 16667 [ERROR] Slave SQL: Column 0 of table 'myschema.mytable' cannot be converted from type 'char(3)' to type 'int(11) unsigned', Error_code: 1677
2017-01-10 13:53:06 16667 [ERROR] WSREP: Failed to apply trx 760719957 4 times
2017-01-10 13:53:06 16667 [ERROR] WSREP: Node consistency compromized, aborting...

What did just happen? Galera replicates between nodes using ROW based replication - this means that the row data gets replicated, instead of the statement that inserts or updates it. This is done for many good reasons (compatibility with non deterministic statements, for example). But in my situation, this was biting me hard - a row inserted by the application in one of the other cluster nodes reached my node where the table structure had been altered, the local node found that the first field of the row was of a different type, and failed as a result. In order to preserve consistency, after some unsuccessful retries, Galera aborted the MySQL server on the node.

Epic fail....

But I still needed to fix the damn tables. While the node was being (automatically) rebuilt, I spent some time thinking about a way to complete this task.  Replication had failed because the first column of the modified table was found to be different type vs. the first column of the table on other nodes. Hmmm, what about moving the new PK column to the end of the table instead of keeping it at the top, I thought. Ugly for sure, but I was curious to see if replication would work correctly in this case. In theory, the columns are matched by position in RBR, so replication should not complain after all.
Pronto, as soon as the node came back in service I set it once again in RSU mode and executed my DDL again:

ALTER TABLE myschema.mytable ADD id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;

Note that the only difference here is that I omitted the FIRST clause at the end this time. This will cause the new column to be added to the bottom of the table, as last column.
After executing the DDL I anxiously watched the table waiting for a replicated row to arrive... and I was really happy to see that rows were entering without problems now. Yay!

But wait, how could I now repeat these steps on the other nodes in a consistent way? The new shiny primary key was an auto_increment one (and it had to be, because the application couldn't populate it for me, it didn't even know this column existed),  but if I was to repeat the above DDL on other nodes, I would get different PK values inserted in each row - Galera Cluster uses different auto_increment values on each node by design (this is something we also use in classic master-master setups by the way).  One option could have been to play with the global variables and change the increment manually on each node before executing my DDL, then put it back to what it was.... but I felt this too dangerous and error prone, and also a bit complicated to repeat for each and every table that I had to fix.

And, last but not least, I wanted the damn PK to be first column in the table (yes I'm picky).

But wait again, now that I've got a PK on this table - pt-online-schema-change could do the hard work for me maybe? pt-osc normally works on the master, creating a copy of the original table with the desired changes, and populates it by writing chunks of rows read from the original table. The original table is never written to, and is not checked on the other nodes.  It shouldn't matter that the original table does not have a PK on the other nodes. I thought the tool should be able to complete my change.

I modified the wsrep_OSU_method and put it back to the default value of 'TOI', so that all changes would be propagated (remember that pt-osc executes DDLs to swap tables at the end of the copy process).  I fired the tool with the following --alter option:


modify column id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT FIRST

This basically told the tool to alter the table, and just move the id column on top.
pt-osc diligently created a copy of the table with the PK moved to 1st column, and filled it with all data from the original table that I had previously altered; then swapped the original and the new table.
That did the trick, since all the operations performed by pt-osc got propagated to all nodes of the cluster via replication. When pt-osc completed, the table was consistent on all nodes, and with a brand new auto_increment PK on top.

Mission accomplished!!

Bottom line: remember to always add primary keys to your tables...


ADDENDUM:

There is an important step that I forgot to mention in the above blog post (please bear with me) .  Of course the node that you set to RSU, and  where you run the ALTER statement(s), should not be receiving any write traffic during the operation, to avoid stalls.

But the very important step is: after you ALTER all your tables on the above node, and *before* you start running your pt-online-schema-change session(s), you need to make this node your only master (all writes need to go to this node!) until you're done with pt-osc. 
The reason is that the CREATE TRIGGER statements that pt-osc uses to keep your new table up to date while it runs, will fail on the other nodes, because the triggers needs to be created on the column you have  just added, and that column is not present on the other nodes yet.

Beware, if you don't follow the above step, all the changes performed by the applications to the table(s)  that happen while you run the pt-osc will be lost.
Make sure these changes only come to the node where you run pt-osc, and everything will be safe.