Thursday, February 23, 2017

When order of appearance of indexes matters in MySQL

Sometimes MySQL surprises you in ways you would have never imagined.

Would you think that the order in which the indexes appear in a table matters?
It does. Mind you, not the order of the columns - the order of the indexes.
MySQL optimizer can, in specific circumstances, take different paths, sometimes with nefarious effects.


Please consider the following table:

CREATE TABLE `mypartitionedtable ` (
  `HASH_ID` char(64) NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
  KEY `HASH_ID_IX` (`HASH_ID`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(EXPIRE_DATE))
(PARTITION p2005 VALUES LESS THAN (1487847600) ENGINE = TokuDB,
 PARTITION p2006 VALUES LESS THAN (1487851200) ENGINE = TokuDB,
 PARTITION p2007 VALUES LESS THAN (1487854800) ENGINE = TokuDB,
 PARTITION p2008 VALUES LESS THAN (1487858400) ENGINE = TokuDB,
[ ... ]


This is a table where we store key-value data for a cache. The table is partitioned by range on the cache expire date, so we can actually apply retention to the cache data.

Now, consider the following DML:


DELETE FROM mypartitionedtable WHERE HASH_ID = 'somehash' AND EXPIRE_DATE > NOW() ;

One would think that the plan for the query above is optimal: it goes by HASH_ID_IX index and also prunes partitions, by only looking at partitions with non-expired data.

In reality, it depends. 
Statistics for partitioned tables in MySQL are only computed on the largest partition. 
This works well most of the times, *except* when the largest partition becomes one of the expired ones.

Let's see this in detail...

We have several partitions, each partition contains a different day worth of data, based on values of the EXPIRE_DATE column. For the purpose of this post, let's assume partition p654 contains yesterday's data, and partition p666 contains data with an expire date which falls tomorrow. 

If we explain a query for a partition that contains valid cache data, i.e. a partition that has all rows with EXPIRE_DATE in the future, everything looks good: you can see that the cost for a search by HASH_ID is much less than the cost for a search by EXPIRE_DATE, as highlighted by the value of the "rows" column:


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows   | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL | 291978 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.01 sec)


But what happens if we explain the same query for a partition which has all rows with an EXPIRE_DATE in the past?


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL |    1 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
1 row in set (0.01 sec)


Whoa, cost is the same for both queries now!

This has a very bad effect, as the optimizer will choose to use EXPIRE_DATE to select our rows and disregard HASH_ID completely.  And it will do this for all partitions, not just this one (where the plan actually would be ok).

Our DELETE above becomes now a locking scan of all partitions having EXPIRE_DATE > NOW(), this means (in our case) almost the entire table.

Ouch!!

This is in my opinion a bug - the fact that the optimizer only looks at the largest partition is ... sub-optimal, if you ask me. But in this case, the optimizer should also evaluate that EXPIRE_DATE is a partitioning key and choose the other index for the query, which would be correct.



Now for the fun part of this post.  

I was puzzled about what was causing the optimizer to pick one index and not the other, considering that they had identical cost. Unfortunately, explain wasn't telling me anything useful, so I decided to instrument some debugging to find out.

Debugging the MySQL server actually involves either recompiling from source with debugging option enabled, or (for lazy DBAs like myself) installing the debug package from your vendor of choice.

I was glad to find out that the stock Oracle distribution (Community edition) already includes the debugging binary of the server, called mysqld-debug. So I stopped the server, replaced mysqld with mysqld-debug, and restarted.
This was not enough, however, to have any debug output.  A quick look at the manual and I found that there is a variable that needs instrumented, called simply debug. More information about debugging MySQL is here, however all it's needed is to set this global variable as follows:

mysql>set global debug='d:L:F:O';

This will cause debugging output to go to the MySQL error log. It will print debugging output prefixed by source code file and line , useful if you want to look at source code to understand what's going on. 

So I ran the explain for the DELETE on the partition with yesterday data, and here is what I got (uninteresting debugging output removed for clarity):

 opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 17:28:51 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f58c00b1fd8 (ROR scans)  scans: HASH_ID
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key EXPIRE_DATE_IX, cost 5.81, records 4

By looking at the output above, it was confirmed that the optimizer was giving the two indexes the same exact cost: 5.81.  The index on EXPIRE_DATE was selected (chosen: 1) and the other one discarded (cause: "cost").

To me, this looked quite like an arbitrary decision to pick the first index of the set, all costs being equal. 

Makes sense? 

Well, it does, if cost is really identical, but in my case, I think the optimizer should have considered that one of the two indexes is a partitioning key, and pick the other. 

I said this was an arbitrary decision. 
But, the same decision could have been: pick the *last* index of the set, instead of the first. 
Hey, let's patch the source code, I thought. But then an idea came to my mind.  What if the order of appearance of the indexes in the table was reversed? After all, the code is simply scanning all available indexes, one at a time, evaluating them and assigning each of them a cost.

I quickly created another identical table, but with the order of the two indexes reversed:

CREATE TABLE `BIG_STORAGE_INNO` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` datetime NOT NULL,
  KEY `HASH_ID_IX` (`HASH_ID`),
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
 PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
 PARTITION p656 VALUES LESS THAN (736743) ENGINE = InnoDB,
 PARTITION p657 VALUES LESS THAN (736744) ENGINE = InnoDB,
 PARTITION p658 VALUES LESS THAN (736745) ENGINE = InnoDB,
 [ .... ]


You can see that the order of the columns is the same as before, but the order of the two indexes is reversed - now HASH_ID_IX comes first. 

Guess what happens now in the optimizer?

opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 18:20:05 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f7da4095838 (ROR scans)  scans: HASH_ID_IX
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key HASH_ID_IX, cost 5.81, records 4


Bingo!!  The right index comes first now, and is chosen in every situation.
My problem is solved - well, I call this a workaround, although an interesting one.

I have raised a bug on the subject, I really believe that in this situation, the fact that one of the two identically expensive indexes is a partitioning key should be taken into account.

For sure this has been an amusing troubleshooting... and an interesting finding!


ADDENDUM:

I have just discovered that the order of appearance of indexes also matters (a lot) when using row based replication and tables without a primary key!

 The sql thread will use the first index in order of appearance to scan the table if it needs to search rows for an update .... this may result in very long time spent searching for rows if the first index in the table is not very selective. You can spot that because the slave thread will spend lot of time in  Update_rows_log_event::find_row state for each update statement on the table without the PK. 

The fix is to make sure a very selective index appears first in the table. This may be a life saver (it was in my case!)

Hope this helps!

Rick



6 comments:

  1. Hi Rick,

    Good analysis. You are right that if cost is the same, the optimizer will pick the index that is defined first.

    Note that the information that you get from your debugging output is available through optimizer trace. See http://oysteing.blogspot.no/2016/01/how-to-get-optimizer-trace-for-query.html for information on how to enable and inspect optimizer trace.

    ReplyDelete
    Replies
    1. Hey, great tip, didn't know about it - gave it a try and it is more or less the same information (maybe a bit more verbose), useful for production systems where you can't enable debugging mode.
      Oddly enough, the cost reported by the trace for the exact same situation is different than what is shown by debugging info... I got 5.81 and the optimizer trace reports 2.21. Anyways, very useful indeed. Thanks!!

      Rick

      Delete
  2. Percona put together a patch for tokudb partitioned table for optimier and statistics that may be of interested to you.
    https://www.percona.com/doc/percona-server/5.6/tokudb/tokudb_background_analyze_table.html#tokudb-background-analyze-table

    ReplyDelete
    Replies
    1. Thanks for your comment. Interesting to see the introduction of this feature, however, even if Toku can analyze tables in background, I'm afraid this still would not help, as the issue is happening at the MySQL layer and not at the engine layer. I tried to run analyze table directly, this didn't change the outcome (mind you, the bug is triggered when stats are updated and doesn't hit before that). In any case, running analyze table on a write heavy workload is very likely to cause stalls...

      Delete
  3. Does the InnoDB engine behave like this or it is just a TokuDB engine issue?

    ReplyDelete
    Replies
    1. This is a MySQL issue - it happens with any engine. I have found it with TokuDB first and then I have reproduced it with InnoDB as well. The bug report I filed is based on InnoDB.

      Delete