Monday, November 6, 2017

when xtrabackup fails with 'read' returned OS error 0 after upgrading to 5.7

Here's something that has puzzled me for several weeks.
Right after migrating  MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:

171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.


Searching the web didn't yield any useful result.
The error was happening on one slave but not on  another, OS, MySQL and Xtrabackup versions being the same:
  • CentOS release 6.8 (Final)
  • xtrabackup version 2.4.8 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 97330f7)
  • mysqld  Ver 5.7.19-17 for Linux on x86_64 (Percona Server (GPL), Release 17, Revision e19a6b7b73f)
To properly identify the table where the backup was failing I had to set parallelism to 1 (option --parallel=1 of xtrabackup). Examining the table with SHOW CREATE TABLE didn't show anything particular, the backup was failing on different tables and schemas, and tables were all different from each other and didn't have anything special,  or at least nothing that would  give me an hint.

I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.

This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had  an intuition:  since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.

Verifying my suspect was as easy as running:

  ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB

Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.

Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.


ADDENDUM
 
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.

Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.

Monday, July 3, 2017

SET PASSWORD will lock you out of your slave in a mixed 5.6/5.7 MySQL topology

Long time no post.... :-)
Here's something interesting.

Last week I decided to give MySQL 5.7 a try (yes, I am kinda conservative DBA...) and the very same day that I installed my first 5.7 replica I noticed that, after changing my own password on the 5.6 master, I could no longer connect to the 5.7 slave.

Very annoying, to say the least! So I went and dug out the root password (which we do not normally use) and when I connected to the slave I was surprised to see that my password's hash on the 5.7 slave was different than the hash on the 5.6 master. No wonder I couldn't connect....

A bit of research on the MySQL documentation and I understood that 5.7 introduced few changes around the way you work with users' passwords.  SET PASSWORD is now deprecated in favour of ALTER USER: see MySQL 5.7 Reference Manual page.

So far so good, but it's obvious that when Oracle implemented those changes, they didn't consider upgrade paths, or mixed 5.6/5.7 topologies that are commonly found  during a major MySQL version upgrade, like in this case.

Turns out there is a bug first reported by Gillian Gunson back in September 2016 (SET PASSWORD replicates incorrectly between 5.6 and 5.7) which Oracle acknowledged but didn't care to fix at the time of this blog post. Sorry for the rant, but I really feel this to be a real showstopper to any migration where there are many database users and where a password change may happen quite often.

As Gillian pointed out in the above bug, what really happens is that the password is hashed twice, and on the replica we get an hash of the already hashed password - no wonder it doesn't work :-)

The good news is that this is really an easy fix. I was able to patch this myself in few minutes, and the patch is one line of code....  you can find the diff below, this was applied to 5.7.18 Percona source code  but I believe it will  be the same in Oracle MySQL, regardless the minor version.

Basically, I added a check to avoid hashing the password if the thread is a slave thread and the command is  a SET PASSWORD. This check may break the use of SET PASSWORD when master will be on 5.7, I didn't check that - but hey, SET PASSWORD is deprecated anyways... and I will only use my patched version of 5.7 during the migration.

Enjoy!

*** sql/auth/sql_user.cc Mon Jul  3 16:18:55 2017
--- sql/auth/sql_user.cc.patch Mon Jul  3 16:07:01 2017
***************
*** 565,570 ****
--- 565,574 ----
    */
    if ((Str->uses_identified_by_clause ||
        (Str->auth.length == 0 && !user_exists)) 
+      // Rick Pizzi: don't change to hash if thread is a replication thread and a 
+      // SET PASSWORD was issued on master, to avoid double hash
+  // (fix for https://bugs.mysql.com/bug.php?id=82979)
+  && (!thd->slave_thread || thd->lex->sql_command != SQLCOM_SET_OPTION))
    {
      st_mysql_auth *auth= (st_mysql_auth *) plugin_decl(plugin)->info;
      inbuf= Str->auth.str;

Tuesday, May 9, 2017

Slides from Percona Live 2017: Caching your application data with MySQL and TokuDB

Here are the slides from my talk at Percona Live 2017 conference in Santa Clara, CA.
This has been my first opportunity to speak to a public audience and it has been really great!
Looking forward to speak about MySQL again soon. Thank you!!

Wednesday, April 5, 2017

Caching your application data with MySQL and TokuDB @ Percona Live 2017

The great Percona Live 2017 database conference is approaching, and I am very excited to be there this year - not only as an attendee, but also as a speaker!!
By the way, this will be my first public talk so please bear with me if I'll be a bit nervous :-)

If you are attending, come see me and my colleague Andrea speaking about how we leverage the TokuDB engine to cache our application data using MySQL! We will show you some interesting benchmarks, and will describe our design, setup, and configuration in detail.

See you there!!

Tuesday, March 7, 2017

Using xtrabackup with MariaDB and multi source replication

Quick post today, because I wanted to share a snippet (actually, a patch) that allows the great xtrabackup tool to take backups of MariaDB when multi source replication is in use.

Granted, this is supported already, except the slave(s) position is not saved when the backup is taken, and this somewhat defeats the purpose of the backup, which should be used to rebuild the server and resync it with its master(s) in case of disaster.

A workaround for using xtrabackup with MariaDB exists, but only works when you have a single master: just set the default_master_connection variable in your my.cnf to the name of your master connection,  and you are set - in fact, in this situation, SHOW SLAVE STATUS will work exactly the same as in stock MySQL, and xtrabackup is happy.

The trouble begins when you have more than one master.
Each master connection has its own name, and in order to get the slave position you have to issue a SHOW SLAVE 'connectionname' STATUS.

A more easy way to get information for all of them is to issue a SHOW ALL SLAVES STATUS, which will report the same information as SHOW SLAVE STATUS for each connection that you have defined.

The bad news is that this syntax is not understood by xtrabackup, so if you happen to use MariaDB with multiple masters, you have no way to save the slave position of each connection when you take a backup.

So here comes the little snippet below - actually a UNIX style diff, that will work against the innobackupex script which is part of xtrabackup, tested with 2.2.12 and above. It will patch the script so that information for each connection is properly saved into the xtrabackup_slave_info file when backup is taken.

This will not work on version 2.3 because the innobackupex script is now an hard link to the xtrabackup binary, so no easy patching is possible.

Please note - only tested with old school replication, so if you are using GTIDs you'll need to modify the patch, but shouldn't be too hard if you follow the changes I did.

Hope this helps. Enjoy!

$ diff -u /usr/bin/innobackupex /usr/bin/innobackupex.rick
--- /usr/bin/innobackupex    2016-02-02 11:20:55.000000000 +0100
+++ /usr/bin/innobackupex.rick    2017-03-07 16:29:19.000000000 +0100
@@ -3263,12 +3263,27 @@

     if (!defined($master) || !defined($filename) || !defined($position)) {
         my $now = current_time();
+    my $mspos = '';

-        print STDERR "$now  $prefix Failed to get master binlog coordinates " .
-            "from SHOW SLAVE STATUS\n";
-        print STDERR "$now  $prefix This means that the server is not a " .
-            "replication slave. Ignoring the --slave-info option\n";
+    # Rick
+        print STDERR "$now  trying multisource replication setup\n";
+     my $rows = $con->{dbh}->selectall_hashref("SHOW ALL SLAVES STATUS", 'Connection_name');
+    foreach my $conn_id (keys %$rows) {
+            $mspos = $mspos
+                        . "Connection $rows->{$conn_id}{Connection_name} "
+                        . "Master Host $rows->{$conn_id}{Master_Host}, "
+                        . "File $rows->{$conn_id}{Relay_Master_Log_File}, "
+                        . "Position $rows->{$conn_id}{Exec_Master_Log_Pos}\n";
+    }
+        write_to_backup_file("$slave_info", $mspos);
+    if ($mspos eq '') {
+
+            print STDERR "$now  $prefix Failed to get master binlog coordinates " .
+                "from SHOW SLAVE STATUS\n";
+            print STDERR "$now  $prefix This means that the server is not a " .
+                "replication slave. Ignoring the --slave-info option\n";

+    }
         return;
     }









Thursday, February 23, 2017

When order of appearance of indexes matters in MySQL

Sometimes MySQL surprises you in ways you would have never imagined.

Would you think that the order in which the indexes appear in a table matters?
It does. Mind you, not the order of the columns - the order of the indexes.
MySQL optimizer can, in specific circumstances, take different paths, sometimes with nefarious effects.


Please consider the following table:

CREATE TABLE `mypartitionedtable ` (
  `HASH_ID` char(64) NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
  KEY `HASH_ID_IX` (`HASH_ID`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(EXPIRE_DATE))
(PARTITION p2005 VALUES LESS THAN (1487847600) ENGINE = TokuDB,
 PARTITION p2006 VALUES LESS THAN (1487851200) ENGINE = TokuDB,
 PARTITION p2007 VALUES LESS THAN (1487854800) ENGINE = TokuDB,
 PARTITION p2008 VALUES LESS THAN (1487858400) ENGINE = TokuDB,
[ ... ]


This is a table where we store key-value data for a cache. The table is partitioned by range on the cache expire date, so we can actually apply retention to the cache data.

Now, consider the following DML:

DELETE FROM mypartitionedtable WHERE HASH_ID = 'somehash' AND EXPIRE_DATE > NOW() ;

One would think that the plan for the query above is optimal: it goes by HASH_ID_IX index and also prunes partitions, by only looking at partitions with non-expired data.

In reality, it depends. 
Statistics for partitioned tables in MySQL are only computed on the largest partition. 
This works well most of the times, *except* when the largest partition becomes one of the expired ones.

Let's see this in detail...

We have several partitions, each partition contains a different day worth of data, based on values of the EXPIRE_DATE column. For the purpose of this post, let's assume partition p654 contains yesterday's data, and partition p666 contains data with an expire date which falls tomorrow. 

If we explain a query for a partition that contains valid cache data, i.e. a partition that has all rows with EXPIRE_DATE in the future, everything looks good: you can see that the cost for a search by HASH_ID is much less than the cost for a search by EXPIRE_DATE, as highlighted by the value of the "rows" column:


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows   | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL | 291978 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.01 sec)


But what happens if we explain the same query for a partition which has all rows with an EXPIRE_DATE in the past?


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL |    1 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
1 row in set (0.01 sec)


Whoa, cost is the same for both queries now!

This has a very bad effect, as the optimizer will choose to use EXPIRE_DATE to select our rows and disregard HASH_ID completely.  And it will do this for all partitions, not just this one (where the plan actually would be ok).

Our DELETE above becomes now a locking scan of all partitions having EXPIRE_DATE > NOW(), this means (in our case) almost the entire table.

Ouch!!

This is in my opinion a bug - the fact that the optimizer only looks at the largest partition is ... sub-optimal, if you ask me. But in this case, the optimizer should also evaluate that EXPIRE_DATE is a partitioning key and choose the other index for the query, which would be correct.



Now for the fun part of this post.  

I was puzzled about what was causing the optimizer to pick one index and not the other, considering that they had identical cost. Unfortunately, explain wasn't telling me anything useful, so I decided to instrument some debugging to find out.

Debugging the MySQL server actually involves either recompiling from source with debugging option enabled, or (for lazy DBAs like myself) installing the debug package from your vendor of choice.

I was glad to find out that the stock Oracle distribution (Community edition) already includes the debugging binary of the server, called mysqld-debug. So I stopped the server, replaced mysqld with mysqld-debug, and restarted.
This was not enough, however, to have any debug output.  A quick look at the manual and I found that there is a variable that needs instrumented, called simply debug. More information about debugging MySQL is here, however all it's needed is to set this global variable as follows:

mysql>set global debug='d:L:F:O';

This will cause debugging output to go to the MySQL error log. It will print debugging output prefixed by source code file and line , useful if you want to look at source code to understand what's going on. 

So I ran the explain for the DELETE on the partition with yesterday data, and here is what I got (uninteresting debugging output removed for clarity):

 opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 17:28:51 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f58c00b1fd8 (ROR scans)  scans: HASH_ID
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key EXPIRE_DATE_IX, cost 5.81, records 4

By looking at the output above, it was confirmed that the optimizer was giving the two indexes the same exact cost: 5.81.  The index on EXPIRE_DATE was selected (chosen: 1) and the other one discarded (cause: "cost").

To me, this looked quite like an arbitrary decision to pick the first index of the set, all costs being equal. 

Makes sense? 

Well, it does, if cost is really identical, but in my case, I think the optimizer should have considered that one of the two indexes is a partitioning key, and pick the other. 

I said this was an arbitrary decision. 
But, the same decision could have been: pick the *last* index of the set, instead of the first. 
Hey, let's patch the source code, I thought. But then an idea came to my mind.  What if the order of appearance of the indexes in the table was reversed? After all, the code is simply scanning all available indexes, one at a time, evaluating them and assigning each of them a cost.

I quickly created another identical table, but with the order of the two indexes reversed:

CREATE TABLE `BIG_STORAGE_INNO` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` datetime NOT NULL,
  KEY `HASH_ID_IX` (`HASH_ID`),
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
 PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
 PARTITION p656 VALUES LESS THAN (736743) ENGINE = InnoDB,
 PARTITION p657 VALUES LESS THAN (736744) ENGINE = InnoDB,
 PARTITION p658 VALUES LESS THAN (736745) ENGINE = InnoDB,
 [ .... ]


You can see that the order of the columns is the same as before, but the order of the two indexes is reversed - now HASH_ID_IX comes first. 

Guess what happens now in the optimizer?

opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 18:20:05 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f7da4095838 (ROR scans)  scans: HASH_ID_IX
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key HASH_ID_IX, cost 5.81, records 4


Bingo!!  The right index comes first now, and is chosen in every situation.
My problem is solved - well, I call this a workaround, although an interesting one.

I have raised a bug on the subject, I really believe that in this situation, the fact that one of the two identically expensive indexes is a partitioning key should be taken into account.

For sure this has been an amusing troubleshooting... and an interesting finding!






Monday, February 6, 2017

How to expand a striped LVM database volume in Amazon AWS without downtime

This procedure can be used to expand an LVM  database volume on Amazon AWS (but also apply to any storage area network environment equally).
Let me start with this assumption: when you create volumes for database use in AWS using EBS, you stripe data across them in order to enhance performance.  If you aren't doing this... well, you should :-)
Under this assumption, when you need to add more disk space to an existing database volume, you can't just add the disk(s) to the volume, as this would make the added space non striped, and would eventually create hotspots in the dataset. The correct approach in this situation is to create a number of new EBS disks enough to contain entire dataset plus the desired added space,so that you can grow the existing dataset while re-striping properly.

To make this clear, let's suppose you have a dataset volume of  3 TB,  made of 3 1TB EBS volumes which are striped across, but space is running out. You are in a hurry, so you quickly add a new 1 TB EBS volume to the AWS server, add it to the dataset LVM, and expand the volume, as follows:


# add the new EBS disk to LVM
pvcreate /dev/xvdl
# add the new EBS disk to the dataset volume
vgextend dataset /dev/xvdl
# extend the dataset volume using the new EBS disk
lvextend -i 1 -r /dev/dataset/db /dev/xvdl

This will add space to the dataset volume, however, the fourth EBS volume is just appended to the original striped dataset volume, not striped, thus creating possible hotspots.

The proper way to expand the dataset so that it remains striped and therefore with optimum performance, is to add 4 new EBS disks, migrate the entire dataset on them, then release the old 4 disks.

The good news is that this can be done without downtime,  if you follow the steps below.

IMPORTANT: before proceeding, make sure you have valid backups of the database!!

Now, let's suppose you have added 4 EBS 1 TB disks  to the server already, with following path:  

/dev/xvd[ponm]

The original volume, after you added the "emergency" 1 TB EBS disk above, is using:

/dev/xvd[fghl]

The procedure involves migrating the dataset from one group of EBS disks to the other by setting up a mirror, let it sync, then dropping the old mirror side to release the old disks. The important step here is to create the mirror so that it will stripe across the EBS disks properly.

# add new EBS disks to LVM
pvcreate /dev/xvd[ponm]
# add new EBS disks to dataset volume
vgextend dataset /dev/xvd[ponm]
# create a mirror out of existing dataset volume
lvconvert -b --stripes 4 -m1 --mirrorlog core /dev/dataset/db /dev/xvd[ponm]

The above commands just add the new disks to LVM, then to the volume, and then the mirror process is started.
You have to wait until the mirror sync is complete, this may take some time, even days depending on the dataset size. You can verify it is synced using  the following command and output:

# lvs -a -o +devices
  LV                     VG      Attr             LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                                            
  bck                   backup  -wi-ao----     3.00t                                                                                       /dev/xvdi(0),/dev/xvdj(0),/dev/xvdk(0)             
  db                      dataset mwi-aom--- 4.00t                                                                  100.00           db_mimage_0(0),db_mimage_1(0)                      
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdf(0),/dev/xvdg(0),/dev/xvdh(0)             
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdl(0)                                       
  [db_mimage_1] dataset iwi-aom---   4.00t                                                                                        /dev/xvdm(0),/dev/xvdn(0),/dev/xvdo(0),/dev/xvdp(0)


You need to check that he copy sync column shows 100% synced.  You can run this command to see progress at any time. The syncing is done at very low priority and there is little to no overhead, and zero impact on database.   In the example above you can clearly see that the db_mimage0 logical volume, which includes the original side of the mirror,  is actually made by 3 striped disks (f, g, h) and one single striped disk (l), this is what we want to avoid and if you look at db_mimage1 (the new mirror side) you can clearly see it is properly striped over 4 disks now.

Once the mirror is in sync we can safely drop the original side of the mirror, then remove the physical disks from LVM and destroy them on the EBS side:

# removes old side of the mirror
lvconvert -m0 /dev/dataset/db /dev/xvd[fghl]
# removes EBS disks from the dataset VG
vgreduce dataset /dev/xvd[fghl]
# removes EBS disks from LVM
pvremove /dev/xvd[fghl]


Done!! You can safely go to the AWS console and detach the EBS disks you just removed from LVM.
Be careful not to remove the wrong disks, or you will destroy your dataset and database!