After upgrading some of our slaves to latest 5.7, I have found what looks like a serious regression introduced in MySQL 5.7.
A couple weeks ago I noticed that the error log file of one of our clusters, where I had implemented my in place transparent compression of binary logs, was literally flooded by the following error:
[ERROR] Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
In the above setup this is an harmless error, and it should only happen at server startup, where mysqld opens and reads all available binary log files. The error is due to the fact that since files are now compressed, mysqld doesn't recognize them as valid - not an issue, as only older files are compressed, and only after ensuring all slaves already got them.
But I couldn't understand why the error was now hitting every minute or so, after upgrading from 5.6 to 5.7.
After a short investigation I was able to relate the errors to the exact moment where a binary log was closed and a new one created, this normally happens when the binary log file size reaches the value you have set in max_binlog_size variable. But this indicated that mysqld was continuously trying to open and read all the available binary logs, each time a file was rotated.
Ouch! This wasn't a nice finding.
On this server we have in excess of 3,500 binary log files, and I know about installations which have much more (mind you, we have a retention of 4 days only here). We like to keep binlog files small, we have them sized at 100 MB as this makes easier to peruse them when troubleshooting with mysqlbinlog. Anyways this wouldn't change anything, I mean, having 3,500 100M files or 350 1G files would generate exactly the same amount of I/O when you wanted to read them all....
Anyways, I had just found that every time a binary log is rotated, mysqld needs to open all of the files. What a waste of I/O and performances, I thought. So I decided to have a look at the code to understand what mysqld was actually trying to do with those binlogs.
It wasn't too difficult to narrow down the binlog scan with the function purge_logs_before_date()
that expires logs based on time, as set by the expire_logs_days variable.
But reading the binary log files shouldn't be really needed when you're only looking to expire them, so this looked weird to me.
Further digging into the source code shows that mysqld is actually looking for "lost GTIDs" in the binlog files, and does it every time a binary log is closed for rotation. I haven't looked deeper into what a "lost" GTID means in the code, however whatever the reason for this search is, for sure it shouldn't be done repeatedly at every binlog rotation.
The weird thing is, we aren't even using GTIDs!
But still mysqld opens and reads ALL the available binary log files all the time. This is a serious regression compared to 5.6 where this wasn't happening, and it can hit you hard when you have a large number of binary log files. Caveat emptor.
I have raised a bug for the above, and really hope Oracle gives it the needed attention.
The MySQL n00b
Monday, December 11, 2017
Monday, November 6, 2017
when xtrabackup fails with 'read' returned OS error 0 after upgrading to 5.7
Here's something that has puzzled me for several weeks.
Right after migrating MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:
171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.
Searching the web didn't yield any useful result.
The error was happening on one slave but not on another, OS, MySQL and Xtrabackup versions being the same:
I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.
This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had an intuition: since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.
Verifying my suspect was as easy as running:
ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB
Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.
Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.
ADDENDUM
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.
Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.
Right after migrating MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:
171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.
Searching the web didn't yield any useful result.
The error was happening on one slave but not on another, OS, MySQL and Xtrabackup versions being the same:
- CentOS release 6.8 (Final)
- xtrabackup version 2.4.8 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 97330f7)
- mysqld Ver 5.7.19-17 for Linux on x86_64 (Percona Server (GPL), Release 17, Revision e19a6b7b73f)
I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.
This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had an intuition: since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.
Verifying my suspect was as easy as running:
ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB
Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.
Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.
ADDENDUM
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.
Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.
Monday, July 3, 2017
SET PASSWORD will lock you out of your slave in a mixed 5.6/5.7 MySQL topology
Long time no post.... :-)
Here's something interesting.
Last week I decided to give MySQL 5.7 a try (yes, I am kinda conservative DBA...) and the very same day that I installed my first 5.7 replica I noticed that, after changing my own password on the 5.6 master, I could no longer connect to the 5.7 slave.
Very annoying, to say the least! So I went and dug out the root password (which we do not normally use) and when I connected to the slave I was surprised to see that my password's hash on the 5.7 slave was different than the hash on the 5.6 master. No wonder I couldn't connect....
A bit of research on the MySQL documentation and I understood that 5.7 introduced few changes around the way you work with users' passwords. SET PASSWORD is now deprecated in favour of ALTER USER: see MySQL 5.7 Reference Manual page.
So far so good, but it's obvious that when Oracle implemented those changes, they didn't consider upgrade paths, or mixed 5.6/5.7 topologies that are commonly found during a major MySQL version upgrade, like in this case.
Turns out there is a bug first reported by Gillian Gunson back in September 2016 (SET PASSWORD replicates incorrectly between 5.6 and 5.7) which Oracle acknowledged but didn't care to fix at the time of this blog post. Sorry for the rant, but I really feel this to be a real showstopper to any migration where there are many database users and where a password change may happen quite often.
As Gillian pointed out in the above bug, what really happens is that the password is hashed twice, and on the replica we get an hash of the already hashed password - no wonder it doesn't work :-)
The good news is that this is really an easy fix. I was able to patch this myself in few minutes, and the patch is one line of code.... you can find the diff below, this was applied to 5.7.18 Percona source code but I believe it will be the same in Oracle MySQL, regardless the minor version.
Basically, I added a check to avoid hashing the password if the thread is a slave thread and the command is a SET PASSWORD. This check may break the use of SET PASSWORD when master will be on 5.7, I didn't check that - but hey, SET PASSWORD is deprecated anyways... and I will only use my patched version of 5.7 during the migration.
Enjoy!
Here's something interesting.
Last week I decided to give MySQL 5.7 a try (yes, I am kinda conservative DBA...) and the very same day that I installed my first 5.7 replica I noticed that, after changing my own password on the 5.6 master, I could no longer connect to the 5.7 slave.
Very annoying, to say the least! So I went and dug out the root password (which we do not normally use) and when I connected to the slave I was surprised to see that my password's hash on the 5.7 slave was different than the hash on the 5.6 master. No wonder I couldn't connect....
A bit of research on the MySQL documentation and I understood that 5.7 introduced few changes around the way you work with users' passwords. SET PASSWORD is now deprecated in favour of ALTER USER: see MySQL 5.7 Reference Manual page.
So far so good, but it's obvious that when Oracle implemented those changes, they didn't consider upgrade paths, or mixed 5.6/5.7 topologies that are commonly found during a major MySQL version upgrade, like in this case.
Turns out there is a bug first reported by Gillian Gunson back in September 2016 (SET PASSWORD replicates incorrectly between 5.6 and 5.7) which Oracle acknowledged but didn't care to fix at the time of this blog post. Sorry for the rant, but I really feel this to be a real showstopper to any migration where there are many database users and where a password change may happen quite often.
As Gillian pointed out in the above bug, what really happens is that the password is hashed twice, and on the replica we get an hash of the already hashed password - no wonder it doesn't work :-)
The good news is that this is really an easy fix. I was able to patch this myself in few minutes, and the patch is one line of code.... you can find the diff below, this was applied to 5.7.18 Percona source code but I believe it will be the same in Oracle MySQL, regardless the minor version.
Basically, I added a check to avoid hashing the password if the thread is a slave thread and the command is a SET PASSWORD. This check may break the use of SET PASSWORD when master will be on 5.7, I didn't check that - but hey, SET PASSWORD is deprecated anyways... and I will only use my patched version of 5.7 during the migration.
Enjoy!
*** sql/auth/sql_user.cc Mon Jul 3 16:18:55 2017 --- sql/auth/sql_user.cc.patch Mon Jul 3 16:07:01 2017 *************** *** 565,570 **** --- 565,574 ---- */ if ((Str->uses_identified_by_clause || (Str->auth.length == 0 && !user_exists)) + // Rick Pizzi: don't change to hash if thread is a replication thread and a + // SET PASSWORD was issued on master, to avoid double hash + // (fix for https://bugs.mysql.com/bug.php?id=82979) + && (!thd->slave_thread || thd->lex->sql_command != SQLCOM_SET_OPTION)) { st_mysql_auth *auth= (st_mysql_auth *) plugin_decl(plugin)->info; inbuf= Str->auth.str;
Tuesday, May 9, 2017
Slides from Percona Live 2017: Caching your application data with MySQL and TokuDB
Here are the slides from my talk at Percona Live 2017 conference in Santa Clara, CA.
This has been my first opportunity to speak to a public audience and it has been really great!
Looking forward to speak about MySQL again soon. Thank you!!
This has been my first opportunity to speak to a public audience and it has been really great!
Looking forward to speak about MySQL again soon. Thank you!!
Wednesday, April 5, 2017
Caching your application data with MySQL and TokuDB @ Percona Live 2017
The great Percona Live 2017 database conference is approaching, and I am very excited to be there this year - not only as an attendee, but also as a speaker!!
By the way, this will be my first public talk so please bear with me if I'll be a bit nervous :-)
If you are attending, come see me and my colleague Andrea speaking about how we leverage the TokuDB engine to cache our application data using MySQL! We will show you some interesting benchmarks, and will describe our design, setup, and configuration in detail.
See you there!!
By the way, this will be my first public talk so please bear with me if I'll be a bit nervous :-)
If you are attending, come see me and my colleague Andrea speaking about how we leverage the TokuDB engine to cache our application data using MySQL! We will show you some interesting benchmarks, and will describe our design, setup, and configuration in detail.
See you there!!
Tuesday, March 7, 2017
Using xtrabackup with MariaDB and multi source replication
Quick post today, because I wanted to share a snippet (actually, a patch) that allows the great xtrabackup tool to take backups of MariaDB when multi source replication is in use.
Granted, this is supported already, except the slave(s) position is not saved when the backup is taken, and this somewhat defeats the purpose of the backup, which should be used to rebuild the server and resync it with its master(s) in case of disaster.
A workaround for using xtrabackup with MariaDB exists, but only works when you have a single master: just set the default_master_connection variable in your my.cnf to the name of your master connection, and you are set - in fact, in this situation, SHOW SLAVE STATUS will work exactly the same as in stock MySQL, and xtrabackup is happy.
The trouble begins when you have more than one master.
Each master connection has its own name, and in order to get the slave position you have to issue a SHOW SLAVE 'connectionname' STATUS.
A more easy way to get information for all of them is to issue a SHOW ALL SLAVES STATUS, which will report the same information as SHOW SLAVE STATUS for each connection that you have defined.
The bad news is that this syntax is not understood by xtrabackup, so if you happen to use MariaDB with multiple masters, you have no way to save the slave position of each connection when you take a backup.
So here comes the little snippet below - actually a UNIX style diff, that will work against the innobackupex script which is part of xtrabackup, tested with 2.2.12 and above. It will patch the script so that information for each connection is properly saved into the xtrabackup_slave_info file when backup is taken.
This will not work on version 2.3 because the innobackupex script is now an hard link to the xtrabackup binary, so no easy patching is possible.
Please note - only tested with old school replication, so if you are using GTIDs you'll need to modify the patch, but shouldn't be too hard if you follow the changes I did.
Hope this helps. Enjoy!
$ diff -u /usr/bin/innobackupex /usr/bin/innobackupex.rick
--- /usr/bin/innobackupex 2016-02-02 11:20:55.000000000 +0100
+++ /usr/bin/innobackupex.rick 2017-03-07 16:29:19.000000000 +0100
@@ -3263,12 +3263,27 @@
if (!defined($master) || !defined($filename) || !defined($position)) {
my $now = current_time();
+ my $mspos = '';
- print STDERR "$now $prefix Failed to get master binlog coordinates " .
- "from SHOW SLAVE STATUS\n";
- print STDERR "$now $prefix This means that the server is not a " .
- "replication slave. Ignoring the --slave-info option\n";
+ # Rick
+ print STDERR "$now trying multisource replication setup\n";
+ my $rows = $con->{dbh}->selectall_hashref("SHOW ALL SLAVES STATUS", 'Connection_name');
+ foreach my $conn_id (keys %$rows) {
+ $mspos = $mspos
+ . "Connection $rows->{$conn_id}{Connection_name} "
+ . "Master Host $rows->{$conn_id}{Master_Host}, "
+ . "File $rows->{$conn_id}{Relay_Master_Log_File}, "
+ . "Position $rows->{$conn_id}{Exec_Master_Log_Pos}\n";
+ }
+ write_to_backup_file("$slave_info", $mspos);
+ if ($mspos eq '') {
+
+ print STDERR "$now $prefix Failed to get master binlog coordinates " .
+ "from SHOW SLAVE STATUS\n";
+ print STDERR "$now $prefix This means that the server is not a " .
+ "replication slave. Ignoring the --slave-info option\n";
+ }
return;
}
Granted, this is supported already, except the slave(s) position is not saved when the backup is taken, and this somewhat defeats the purpose of the backup, which should be used to rebuild the server and resync it with its master(s) in case of disaster.
A workaround for using xtrabackup with MariaDB exists, but only works when you have a single master: just set the default_master_connection variable in your my.cnf to the name of your master connection, and you are set - in fact, in this situation, SHOW SLAVE STATUS will work exactly the same as in stock MySQL, and xtrabackup is happy.
The trouble begins when you have more than one master.
Each master connection has its own name, and in order to get the slave position you have to issue a SHOW SLAVE 'connectionname' STATUS.
A more easy way to get information for all of them is to issue a SHOW ALL SLAVES STATUS, which will report the same information as SHOW SLAVE STATUS for each connection that you have defined.
The bad news is that this syntax is not understood by xtrabackup, so if you happen to use MariaDB with multiple masters, you have no way to save the slave position of each connection when you take a backup.
So here comes the little snippet below - actually a UNIX style diff, that will work against the innobackupex script which is part of xtrabackup, tested with 2.2.12 and above. It will patch the script so that information for each connection is properly saved into the xtrabackup_slave_info file when backup is taken.
This will not work on version 2.3 because the innobackupex script is now an hard link to the xtrabackup binary, so no easy patching is possible.
Please note - only tested with old school replication, so if you are using GTIDs you'll need to modify the patch, but shouldn't be too hard if you follow the changes I did.
Hope this helps. Enjoy!
$ diff -u /usr/bin/innobackupex /usr/bin/innobackupex.rick
--- /usr/bin/innobackupex 2016-02-02 11:20:55.000000000 +0100
+++ /usr/bin/innobackupex.rick 2017-03-07 16:29:19.000000000 +0100
@@ -3263,12 +3263,27 @@
if (!defined($master) || !defined($filename) || !defined($position)) {
my $now = current_time();
+ my $mspos = '';
- print STDERR "$now $prefix Failed to get master binlog coordinates " .
- "from SHOW SLAVE STATUS\n";
- print STDERR "$now $prefix This means that the server is not a " .
- "replication slave. Ignoring the --slave-info option\n";
+ # Rick
+ print STDERR "$now trying multisource replication setup\n";
+ my $rows = $con->{dbh}->selectall_hashref("SHOW ALL SLAVES STATUS", 'Connection_name');
+ foreach my $conn_id (keys %$rows) {
+ $mspos = $mspos
+ . "Connection $rows->{$conn_id}{Connection_name} "
+ . "Master Host $rows->{$conn_id}{Master_Host}, "
+ . "File $rows->{$conn_id}{Relay_Master_Log_File}, "
+ . "Position $rows->{$conn_id}{Exec_Master_Log_Pos}\n";
+ }
+ write_to_backup_file("$slave_info", $mspos);
+ if ($mspos eq '') {
+
+ print STDERR "$now $prefix Failed to get master binlog coordinates " .
+ "from SHOW SLAVE STATUS\n";
+ print STDERR "$now $prefix This means that the server is not a " .
+ "replication slave. Ignoring the --slave-info option\n";
+ }
return;
}
Thursday, February 23, 2017
When order of appearance of indexes matters in MySQL
Sometimes MySQL surprises you in ways you would have never imagined.
Would you think that the order in which the indexes appear in a table matters?
It does. Mind you, not the order of the columns - the order of the indexes.
MySQL optimizer can, in specific circumstances, take different paths, sometimes with nefarious effects.
Please consider the following table:
CREATE TABLE `mypartitionedtable ` (
`HASH_ID` char(64) NOT NULL,
`RAW_DATA` mediumblob NOT NULL,
`EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
KEY `HASH_ID_IX` (`HASH_ID`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT= TOKUDB_ UNCOMPRESSED
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP (EXPIRE_ DATE))
(PARTITION p2005 VALUES LESS THAN (1487847600) ENGINE = TokuDB,
PARTITION p2006 VALUES LESS THAN (1487851200) ENGINE = TokuDB,
PARTITION p2007 VALUES LESS THAN (1487854800) ENGINE = TokuDB,
PARTITION p2008 VALUES LESS THAN (1487858400) ENGINE = TokuDB,
[ ... ]
This is a table where we store key-value data for a cache. The table is partitioned by range on the cache expire date, so we can actually apply retention to the cache data.
Now, consider the following DML:
Would you think that the order in which the indexes appear in a table matters?
It does. Mind you, not the order of the columns - the order of the indexes.
MySQL optimizer can, in specific circumstances, take different paths, sometimes with nefarious effects.
Please consider the following table:
CREATE TABLE `mypartitionedtable ` (
`HASH_ID` char(64) NOT NULL,
`RAW_DATA` mediumblob NOT NULL,
`EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
KEY `HASH_ID_IX` (`HASH_ID`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP
(PARTITION p2005 VALUES LESS THAN (1487847600) ENGINE = TokuDB,
PARTITION p2006 VALUES LESS THAN (1487851200) ENGINE = TokuDB,
PARTITION p2007 VALUES LESS THAN (1487854800) ENGINE = TokuDB,
PARTITION p2008 VALUES LESS THAN (1487858400) ENGINE = TokuDB,
[ ... ]
This is a table where we store key-value data for a cache. The table is partitioned by range on the cache expire date, so we can actually apply retention to the cache data.
Now, consider the following DML:
DELETE FROM mypartitionedtable WHERE HASH_ID = 'somehash' AND EXPIRE_DATE > NOW() ;
One would think that the plan for the query above is optimal: it goes by HASH_ID_IX index and also prunes partitions, by only looking at partitions with non-expired data.
In reality, it depends.
Statistics for partitioned tables in MySQL are only computed on the largest partition.
This works well most of the times, *except* when the largest partition becomes one of the expired ones.
Let's see this in detail...
We have several partitions, each partition contains a different day worth of data, based on values of the EXPIRE_DATE column. For the purpose of this post, let's assume partition p654 contains yesterday's data, and partition p666 contains data with an expire date which falls tomorrow.
If we explain a query for a partition that contains valid cache data, i.e. a partition that has all rows with EXPIRE_DATE in the future, everything looks good: you can see that the cost for a search by HASH_ID is much less than the cost for a search by EXPIRE_DATE, as highlighted by the value of the "rows" column:
mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
| 1 | SIMPLE | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5 | NULL | 291978 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)
mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | BIG_STORAGE_INNO | ref | HASH_ID | HASH_ID | 64 | const | 1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.01 sec)
But what happens if we explain the same query for a partition which has all rows with an EXPIRE_DATE in the past?
mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | BIG_STORAGE_INNO | ref | HASH_ID | HASH_ID | 64 | const | 1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)
mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5 | NULL | 1 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
1 row in set (0.01 sec)
Whoa, cost is the same for both queries now!
This has a very bad effect, as the optimizer will choose to use EXPIRE_DATE to select our rows and disregard HASH_ID completely. And it will do this for all partitions, not just this one (where the plan actually would be ok).
Our DELETE above becomes now a locking scan of all partitions having EXPIRE_DATE > NOW(), this means (in our case) almost the entire table.
Ouch!!
This is in my opinion a bug - the fact that the optimizer only looks at the largest partition is ... sub-optimal, if you ask me. But in this case, the optimizer should also evaluate that EXPIRE_DATE is a partitioning key and choose the other index for the query, which would be correct.
Now for the fun part of this post.
I was puzzled about what was causing the optimizer to pick one index and not the other, considering that they had identical cost. Unfortunately, explain wasn't telling me anything useful, so I decided to instrument some debugging to find out.
Debugging the MySQL server actually involves either recompiling from source with debugging option enabled, or (for lazy DBAs like myself) installing the debug package from your vendor of choice.
I was glad to find out that the stock Oracle distribution (Community edition) already includes the debugging binary of the server, called mysqld-debug. So I stopped the server, replaced mysqld with mysqld-debug, and restarted.
This was not enough, however, to have any debug output. A quick look at the manual and I found that there is a variable that needs instrumented, called simply debug. More information about debugging MySQL is here, however all it's needed is to set this global variable as follows:
mysql>set global debug='d:L:F:O';
This will cause debugging output to go to the MySQL error log. It will print debugging output prefixed by source code file and line , useful if you want to look at source code to understand what's going on.
So I ran the explain for the DELETE on the partition with yesterday data, and here is what I got (uninteresting debugging output removed for clarity):
opt_range.cc: 264: get_key_scans_params: opt: range_scan_alternatives: starting struct opt_range.cc: 264: get_key_scans_params: opt: (null): starting struct opt_range.cc: 299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX" opt_range.cc: 9541: check_quick_select: exit: Records: 4 opt_range.cc: 264: get_key_scans_params: opt: ranges: starting struct opt_range.cc: 299: get_key_scans_params: opt: (null): "2017-02-22 17:28:51 < EXPIRE_DATE" opt_range.cc: 279: get_key_scans_params: opt: ranges: ending struct opt_range.cc: 313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1 opt_range.cc: 313: get_key_scans_params: opt: rowid_ordered: 0 opt_range.cc: 313: get_key_scans_params: opt: using_mrr: 0 opt_range.cc: 313: get_key_scans_params: opt: index_only: 0 opt_range.cc: 336: get_key_scans_params: opt: rows: 4 opt_range.cc: 347: get_key_scans_params: opt: cost: 5.81 opt_range.cc: 313: get_key_scans_params: opt: chosen: 1 opt_range.cc: 279: get_key_scans_params: opt: (null): ending struct opt_range.cc: 264: get_key_scans_params: opt: (null): starting struct opt_range.cc: 299: get_key_scans_params: opt: index: "HASH_ID" opt_range.cc: 9541: check_quick_select: exit: Records: 4 opt_range.cc: 264: get_key_scans_params: opt: ranges: starting struct opt_range.cc: 299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9" opt_range.cc: 279: get_key_scans_params: opt: ranges: ending struct opt_range.cc: 313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1 opt_range.cc: 313: get_key_scans_params: opt: rowid_ordered: 1 opt_range.cc: 313: get_key_scans_params: opt: using_mrr: 0 opt_range.cc: 313: get_key_scans_params: opt: index_only: 0 opt_range.cc: 336: get_key_scans_params: opt: rows: 4 opt_range.cc: 347: get_key_scans_params: opt: cost: 5.81 opt_range.cc: 313: get_key_scans_params: opt: chosen: 0 opt_range.cc: 299: get_key_scans_params: opt: cause: "cost" opt_range.cc: 279: get_key_scans_params: opt: (null): ending struct opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f58c00b1fd8 (ROR scans) scans: HASH_ID opt_range.cc: 5676: get_key_scans_params: info: Returning range plan for key EXPIRE_DATE_IX, cost 5.81, records 4
By looking at the output above, it was confirmed that the optimizer was giving the two indexes the same exact cost: 5.81. The index on EXPIRE_DATE was selected (chosen: 1) and the other one discarded (cause: "cost").
To me, this looked quite like an arbitrary decision to pick the first index of the set, all costs being equal.
Makes sense?
Well, it does, if cost is really identical, but in my case, I think the optimizer should have considered that one of the two indexes is a partitioning key, and pick the other.
I said this was an arbitrary decision.
But, the same decision could have been: pick the *last* index of the set, instead of the first.
Hey, let's patch the source code, I thought. But then an idea came to my mind. What if the order of appearance of the indexes in the table was reversed? After all, the code is simply scanning all available indexes, one at a time, evaluating them and assigning each of them a cost.
I quickly created another identical table, but with the order of the two indexes reversed:
CREATE TABLE `BIG_STORAGE_INNO` (
`HASH_ID` char(64) NOT NULL,
`SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
`COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
`RAW_DATA` mediumblob NOT NULL,
`LAST_UPDATE` datetime NOT NULL,
`EXPIRE_DATE` datetime NOT NULL,
KEY `HASH_ID_IX` (`HASH_ID`),
KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
PARTITION p656 VALUES LESS THAN (736743) ENGINE = InnoDB,
PARTITION p657 VALUES LESS THAN (736744) ENGINE = InnoDB,
PARTITION p658 VALUES LESS THAN (736745) ENGINE = InnoDB,
[ .... ]
You can see that the order of the columns is the same as before, but the order of the two indexes is reversed - now HASH_ID_IX comes first.
Guess what happens now in the optimizer?
opt_range.cc: 264: get_key_scans_params: opt: range_scan_alternatives: starting struct
opt_range.cc: 264: get_key_scans_params: opt: (null): starting struct
opt_range.cc: 299: get_key_scans_params: opt: index: "HASH_ID_IX"
opt_range.cc: 9541: check_quick_select: exit: Records: 4
opt_range.cc: 264: get_key_scans_params: opt: ranges: starting struct
opt_range.cc: 299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
opt_range.cc: 279: get_key_scans_params: opt: ranges: ending struct
opt_range.cc: 313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
opt_range.cc: 313: get_key_scans_params: opt: rowid_ordered: 1
opt_range.cc: 313: get_key_scans_params: opt: using_mrr: 0
opt_range.cc: 313: get_key_scans_params: opt: index_only: 0
opt_range.cc: 336: get_key_scans_params: opt: rows: 4
opt_range.cc: 347: get_key_scans_params: opt: cost: 5.81
opt_range.cc: 313: get_key_scans_params: opt: chosen: 1
opt_range.cc: 279: get_key_scans_params: opt: (null): ending struct
opt_range.cc: 264: get_key_scans_params: opt: (null): starting struct
opt_range.cc: 299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
opt_range.cc: 9541: check_quick_select: exit: Records: 4
opt_range.cc: 264: get_key_scans_params: opt: ranges: starting struct
opt_range.cc: 299: get_key_scans_params: opt: (null): "2017-02-22 18:20:05 < EXPIRE_DATE"
opt_range.cc: 279: get_key_scans_params: opt: ranges: ending struct
opt_range.cc: 313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
opt_range.cc: 313: get_key_scans_params: opt: rowid_ordered: 0
opt_range.cc: 313: get_key_scans_params: opt: using_mrr: 0
opt_range.cc: 313: get_key_scans_params: opt: index_only: 0
opt_range.cc: 336: get_key_scans_params: opt: rows: 4
opt_range.cc: 347: get_key_scans_params: opt: cost: 5.81
opt_range.cc: 313: get_key_scans_params: opt: chosen: 0
opt_range.cc: 299: get_key_scans_params: opt: cause: "cost"
opt_range.cc: 279: get_key_scans_params: opt: (null): ending struct
opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f7da4095838 (ROR scans) scans: HASH_ID_IX
opt_range.cc: 5676: get_key_scans_params: info: Returning range plan for key HASH_ID_IX, cost 5.81, records 4
Bingo!! The right index comes first now, and is chosen in every situation.
My problem is solved - well, I call this a workaround, although an interesting one.
I have raised a bug on the subject, I really believe that in this situation, the fact that one of the two identically expensive indexes is a partitioning key should be taken into account.
For sure this has been an amusing troubleshooting... and an interesting finding!
ADDENDUM:
I have just discovered that the order of appearance of indexes also matters (a lot) when using row based replication and tables without a primary key!
The sql thread will use the first index in order of appearance to scan the table if it needs to search rows for an update .... this may result in very long time spent searching for rows if the first index in the table is not very selective. You can spot that because the slave thread will spend lot of time in Update_rows_log_event::find_row state for each update statement on the table without the PK.
The fix is to make sure a very selective index appears first in the table. This may be a life saver (it was in my case!)
Hope this helps!
Rick
I have just discovered that the order of appearance of indexes also matters (a lot) when using row based replication and tables without a primary key!
The sql thread will use the first index in order of appearance to scan the table if it needs to search rows for an update .... this may result in very long time spent searching for rows if the first index in the table is not very selective. You can spot that because the slave thread will spend lot of time in Update_rows_log_event::find_row state for each update statement on the table without the PK.
The fix is to make sure a very selective index appears first in the table. This may be a life saver (it was in my case!)
Hope this helps!
Rick
Subscribe to:
Posts (Atom)