Thursday, February 23, 2017

When order of appearance of indexes matters in MySQL

Sometimes MySQL surprises you in ways you would have never imagined.

Would you think that the order in which the indexes appear in a table matters?
It does. Mind you, not the order of the columns - the order of the indexes.
MySQL optimizer can, in specific circumstances, take different paths, sometimes with nefarious effects.


Please consider the following table:

CREATE TABLE `mypartitionedtable ` (
  `HASH_ID` char(64) NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `EXPIRE_DATE` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
  KEY `HASH_ID_IX` (`HASH_ID`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1 ROW_FORMAT=TOKUDB_UNCOMPRESSED
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(EXPIRE_DATE))
(PARTITION p2005 VALUES LESS THAN (1487847600) ENGINE = TokuDB,
 PARTITION p2006 VALUES LESS THAN (1487851200) ENGINE = TokuDB,
 PARTITION p2007 VALUES LESS THAN (1487854800) ENGINE = TokuDB,
 PARTITION p2008 VALUES LESS THAN (1487858400) ENGINE = TokuDB,
[ ... ]


This is a table where we store key-value data for a cache. The table is partitioned by range on the cache expire date, so we can actually apply retention to the cache data.

Now, consider the following DML:


DELETE FROM mypartitionedtable WHERE HASH_ID = 'somehash' AND EXPIRE_DATE > NOW() ;

One would think that the plan for the query above is optimal: it goes by HASH_ID_IX index and also prunes partitions, by only looking at partitions with non-expired data.

In reality, it depends. 
Statistics for partitioned tables in MySQL are only computed on the largest partition. 
This works well most of the times, *except* when the largest partition becomes one of the expired ones.

Let's see this in detail...

We have several partitions, each partition contains a different day worth of data, based on values of the EXPIRE_DATE column. For the purpose of this post, let's assume partition p654 contains yesterday's data, and partition p666 contains data with an expire date which falls tomorrow. 

If we explain a query for a partition that contains valid cache data, i.e. a partition that has all rows with EXPIRE_DATE in the future, everything looks good: you can see that the cost for a search by HASH_ID is much less than the cost for a search by EXPIRE_DATE, as highlighted by the value of the "rows" column:


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows   | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL | 291978 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p666) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.01 sec)


But what happens if we explain the same query for a partition which has all rows with an EXPIRE_DATE in the past?


mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE HASH_ID = 'any hash value';
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table            | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | ref  | HASH_ID       | HASH_ID | 64      | const |    1 | Using where; Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql>EXPLAIN SELECT COUNT(*) FROM BIG_STORAGE_INNO PARTITION(p654) WHERE EXPIRE_DATE > NOW();
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
| id | select_type | table            | type  | possible_keys  | key            | key_len | ref  | rows | Extra                    |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX | EXPIRE_DATE_IX | 5       | NULL |    1 | Using where; Using index |
+----+-------------+------------------+-------+----------------+----------------+---------+------+------+--------------------------+
1 row in set (0.01 sec)


Whoa, cost is the same for both queries now!

This has a very bad effect, as the optimizer will choose to use EXPIRE_DATE to select our rows and disregard HASH_ID completely.  And it will do this for all partitions, not just this one (where the plan actually would be ok).

Our DELETE above becomes now a locking scan of all partitions having EXPIRE_DATE > NOW(), this means (in our case) almost the entire table.

Ouch!!

This is in my opinion a bug - the fact that the optimizer only looks at the largest partition is ... sub-optimal, if you ask me. But in this case, the optimizer should also evaluate that EXPIRE_DATE is a partitioning key and choose the other index for the query, which would be correct.



Now for the fun part of this post.  

I was puzzled about what was causing the optimizer to pick one index and not the other, considering that they had identical cost. Unfortunately, explain wasn't telling me anything useful, so I decided to instrument some debugging to find out.

Debugging the MySQL server actually involves either recompiling from source with debugging option enabled, or (for lazy DBAs like myself) installing the debug package from your vendor of choice.

I was glad to find out that the stock Oracle distribution (Community edition) already includes the debugging binary of the server, called mysqld-debug. So I stopped the server, replaced mysqld with mysqld-debug, and restarted.
This was not enough, however, to have any debug output.  A quick look at the manual and I found that there is a variable that needs instrumented, called simply debug. More information about debugging MySQL is here, however all it's needed is to set this global variable as follows:

mysql>set global debug='d:L:F:O';

This will cause debugging output to go to the MySQL error log. It will print debugging output prefixed by source code file and line , useful if you want to look at source code to understand what's going on. 

So I ran the explain for the DELETE on the partition with yesterday data, and here is what I got (uninteresting debugging output removed for clarity):

 opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 17:28:51 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f58c00b1fd8 (ROR scans)  scans: HASH_ID
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key EXPIRE_DATE_IX, cost 5.81, records 4

By looking at the output above, it was confirmed that the optimizer was giving the two indexes the same exact cost: 5.81.  The index on EXPIRE_DATE was selected (chosen: 1) and the other one discarded (cause: "cost").

To me, this looked quite like an arbitrary decision to pick the first index of the set, all costs being equal. 

Makes sense? 

Well, it does, if cost is really identical, but in my case, I think the optimizer should have considered that one of the two indexes is a partitioning key, and pick the other. 

I said this was an arbitrary decision. 
But, the same decision could have been: pick the *last* index of the set, instead of the first. 
Hey, let's patch the source code, I thought. But then an idea came to my mind.  What if the order of appearance of the indexes in the table was reversed? After all, the code is simply scanning all available indexes, one at a time, evaluating them and assigning each of them a cost.

I quickly created another identical table, but with the order of the two indexes reversed:

CREATE TABLE `BIG_STORAGE_INNO` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` datetime NOT NULL,
  KEY `HASH_ID_IX` (`HASH_ID`),
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
 PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
 PARTITION p656 VALUES LESS THAN (736743) ENGINE = InnoDB,
 PARTITION p657 VALUES LESS THAN (736744) ENGINE = InnoDB,
 PARTITION p658 VALUES LESS THAN (736745) ENGINE = InnoDB,
 [ .... ]


You can see that the order of the columns is the same as before, but the order of the two indexes is reversed - now HASH_ID_IX comes first. 

Guess what happens now in the optimizer?

opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 18:20:05 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f7da4095838 (ROR scans)  scans: HASH_ID_IX
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key HASH_ID_IX, cost 5.81, records 4


Bingo!!  The right index comes first now, and is chosen in every situation.
My problem is solved - well, I call this a workaround, although an interesting one.

I have raised a bug on the subject, I really believe that in this situation, the fact that one of the two identically expensive indexes is a partitioning key should be taken into account.

For sure this has been an amusing troubleshooting... and an interesting finding!


ADDENDUM:

I have just discovered that the order of appearance of indexes also matters (a lot) when using row based replication and tables without a primary key!

 The sql thread will use the first index in order of appearance to scan the table if it needs to search rows for an update .... this may result in very long time spent searching for rows if the first index in the table is not very selective. You can spot that because the slave thread will spend lot of time in  Update_rows_log_event::find_row state for each update statement on the table without the PK. 

The fix is to make sure a very selective index appears first in the table. This may be a life saver (it was in my case!)

Hope this helps!

Rick



Monday, February 6, 2017

How to expand a striped LVM database volume in Amazon AWS without downtime

This procedure can be used to expand an LVM  database volume on Amazon AWS (but also apply to any storage area network environment equally).
Let me start with this assumption: when you create volumes for database use in AWS using EBS, you stripe data across them in order to enhance performance.  If you aren't doing this... well, you should :-)
Under this assumption, when you need to add more disk space to an existing database volume, you can't just add the disk(s) to the volume, as this would make the added space non striped, and would eventually create hotspots in the dataset. The correct approach in this situation is to create a number of new EBS disks enough to contain entire dataset plus the desired added space,so that you can grow the existing dataset while re-striping properly.

To make this clear, let's suppose you have a dataset volume of  3 TB,  made of 3 1TB EBS volumes which are striped across, but space is running out. You are in a hurry, so you quickly add a new 1 TB EBS volume to the AWS server, add it to the dataset LVM, and expand the volume, as follows:


# add the new EBS disk to LVM
pvcreate /dev/xvdl
# add the new EBS disk to the dataset volume
vgextend dataset /dev/xvdl
# extend the dataset volume using the new EBS disk
lvextend -i 1 -r /dev/dataset/db /dev/xvdl

This will add space to the dataset volume, however, the fourth EBS volume is just appended to the original striped dataset volume, not striped, thus creating possible hotspots.

The proper way to expand the dataset so that it remains striped and therefore with optimum performance, is to add 4 new EBS disks, migrate the entire dataset on them, then release the old 4 disks.

The good news is that this can be done without downtime,  if you follow the steps below.

IMPORTANT: before proceeding, make sure you have valid backups of the database!!

Now, let's suppose you have added 4 EBS 1 TB disks  to the server already, with following path:  

/dev/xvd[ponm]

The original volume, after you added the "emergency" 1 TB EBS disk above, is using:

/dev/xvd[fghl]

The procedure involves migrating the dataset from one group of EBS disks to the other by setting up a mirror, let it sync, then dropping the old mirror side to release the old disks. The important step here is to create the mirror so that it will stripe across the EBS disks properly.

# add new EBS disks to LVM
pvcreate /dev/xvd[ponm]
# add new EBS disks to dataset volume
vgextend dataset /dev/xvd[ponm]
# create a mirror out of existing dataset volume
lvconvert -b --stripes 4 -m1 --mirrorlog core /dev/dataset/db /dev/xvd[ponm]

The above commands just add the new disks to LVM, then to the volume, and then the mirror process is started.
You have to wait until the mirror sync is complete, this may take some time, even days depending on the dataset size. You can verify it is synced using  the following command and output:

# lvs -a -o +devices
  LV                     VG      Attr             LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                                            
  bck                   backup  -wi-ao----     3.00t                                                                                       /dev/xvdi(0),/dev/xvdj(0),/dev/xvdk(0)             
  db                      dataset mwi-aom--- 4.00t                                                                  100.00           db_mimage_0(0),db_mimage_1(0)                      
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdf(0),/dev/xvdg(0),/dev/xvdh(0)             
  [db_mimage_0] dataset iwi-aom---   4.00t                                                                                        /dev/xvdl(0)                                       
  [db_mimage_1] dataset iwi-aom---   4.00t                                                                                        /dev/xvdm(0),/dev/xvdn(0),/dev/xvdo(0),/dev/xvdp(0)


You need to check that he copy sync column shows 100% synced.  You can run this command to see progress at any time. The syncing is done at very low priority and there is little to no overhead, and zero impact on database.   In the example above you can clearly see that the db_mimage0 logical volume, which includes the original side of the mirror,  is actually made by 3 striped disks (f, g, h) and one single striped disk (l), this is what we want to avoid and if you look at db_mimage1 (the new mirror side) you can clearly see it is properly striped over 4 disks now.

Once the mirror is in sync we can safely drop the original side of the mirror, then remove the physical disks from LVM and destroy them on the EBS side:

# removes old side of the mirror
lvconvert -m0 /dev/dataset/db /dev/xvd[fghl]
# removes EBS disks from the dataset VG
vgreduce dataset /dev/xvd[fghl]
# removes EBS disks from LVM
pvremove /dev/xvd[fghl]


Done!! You can safely go to the AWS console and detach the EBS disks you just removed from LVM.
Be careful not to remove the wrong disks, or you will destroy your dataset and database!