Tuesday, September 27, 2016

If enforcing data retention via DROP PARTITION, better watch your auto_increment counter

One of the main tasks that any DBA has to deal with is certainly data retention. Especially when your developers like to store logging and tracking information inside the database. This is not something that I would recommend (don't try it at home!) but when you're late to the party (read: someone had taken this path before you joined) , then you'll have to live with it.

Data retention in MySQL is usually applied by partitioning the table by RANGE on a selected column, usually containing the row creation timestamp. New partitions are created in advance for the coming days or weeks or months, and a script is used that will run nightly and drop partitions that are older than a set retention.

There are many tools that can be used to automate data retention; I have chosen the excellent pdb-parted, a nice and cozy perl script that you can find in the PalominoDB repository on GitHub (since PalominoDB is no longer in existence, I have forked the repository under my own account, you can also get the script here).

It goes without saying that when I first joined, although there were many tables already containing millions of rows of logging, statistical and tracking data, there were no pruners in effect, i.e. no retention at all was applied. Over the next weeks and months I managed to get the required information from the various dev teams and started pruning the bigger tables.  After a couple of months of strict diet, I had recovered about 250 GB of space out of our main cluster, or about 30% of total dataset size at that time. Not bad!  Backup and restore times improved, and also their footprint was much smaller.

Over the next several months the company kept growing very fast. A DWH Team was set up and they started ingesting data from MySQL inside their own systems (Vertica based). Some of the data that was ingested was coming from these pruned tables.

Everything went well for some time, then I received a visit from one of the DWH guys with a strange issue: they were seeing duplicate keys in the data ingested from one of the partitioned tables.  I knew that, in theory, this should not be possible, as each and every table of that type has an auto increment primary key, and we had no reports of duplicate keys in production when data was first inserted.

At first I tried to blame the ETLs that they use to import data from MySQL to Vertica, but soon I realised that the table that they were having the issue with had a notable difference:  it was really small in size, only few hundreds rows, while all other tables contained hundreds of millions.  So I asked them for the MAX value of the auto increment key they had in their systems, and to my surprise, it was higher than the MAX value we had in MySQL!!! D'oh.

After some investigation it turned out that this particular table was only used for debugging, therefore the application  was not writing to it  unless the debug mode was activated. Usually, debug was only turned on for small amount of time (e.g. hours) and then turned off again, and the saved data was used for troubleshooting.

Anyways, it was clear that somehow, the auto_increment in this table was getting reset at some point; it was the only possibility for having an higher value in DWH than what we had in MySQL, as DWH was basically ingesting our rows directly, without any modification whatsoever.

It was easy enough for me to point the finger here, as the only statement that was issued against this table, apart from INSERT, was.... DROP PARTITION.

You will object that DROP PARTITION does not reset the auto_increment counter of a table. Of course I agree, but there was no other possibility here, so it had to be some kind of bug.

It was easy enough to reproduce the behaviour and to confirm that, if the partition you are dropping leaves no other rows in the table, the auto_increment counter is, in fact, reset to 1. You can find the steps in the bug I have submitted to Oracle (please see below).

In my opinion, the behaviour of auto_increment in this situation is not consistent:

  • TRUNCATE TABLE (DDL) empties the table, counter is reset => OK
  • DELETE * FROM TABLE (DML) empties the table, counter is NOT reset => OK
  • DROP PARTITION (DDL) does NOT empty the table, merely drops some data - counter is reset or not depending on content of other partitions  => WEIRD
I really feel that the auto_increment should be left alone when DROP PARTITION is issued, regardless what the content of the other partitions is.

I invite you to check the bug content for some discussion that I had with the Oracle bugs team, where they tried to convince me that this behaviour was correct at first, then acknowledged, but there was not much activity after that.



Monday, September 26, 2016

SHOW WARNINGS can stall your MySQL server if compressed protocol is used

If there is something that I love about information technology is the incredible amount of unpredictable ways that bugs can bite you where and when you less expect it.

This one I'm about to describe is a serious one, yet there has been a very quiet response from Percona bug team at Launchpad, where I first submitted it (because we use Percona server here). If you have read my other posts you know by now that I am a laz... err, busy guy so bear with me if only today I have verified that this is, in fact, an upstream bug and have raised a bug to Oracle too.
It affects 5.6.33, latest version at the time of this article.

Anyways, the bug started to manifest itself in the form of complete stalls of our main cluster, which was happening at different times and apparently in a way that was unrelated to the amount of traffic hitting the master. When stalling, system CPU time was topping 100% of total available CPUs, no wonder that the server was unresponsive...

Complete stalls are not nice: everything is frozen and you can't even check what's going on, as MySQL would not respond to any command.  In some cases this just lasted few seconds, in some other we had stalls which lasted well over 1 minute .

We have the habit to run a script that takes snapshots of "SHOW ENGINE INNODB STATUS" every 10 seconds on each master, and saves them to files; we use these to do post-mortem analysis of incidents that hit the production (traffic spikes and the like). But this time, the snapshots didn't give any useful information: the DB was just running smoothly, low traffic, no sign of anything suspicious.  Few second later, total stall. Then smooth again.   Weird.

I had to find a way to catch the bug while it was biting. I decided to craft to a small script to enable the general log for 60 seconds, then turn it off again and run pt-query-digest on the generated log,  printing a digest of top offenders. Then I waited for the next occurrence, patiently watching sysstat output. As soon as the system CPU time skyrocketed again, I ran my script that captured 60 seconds of full database traffic.

I was really surprised to find that in the sampled time, the top query by number of invocations was SHOW WARNINGS, with about 10,000  invocations per second.
Whoa!!! What the hell is that?  Why are production applications executing SHOW WARNINGS in the first place? And not only that, but the amount of them was INSANE!!

Well, it was now obvious that the server stall was actually caused by a flood of SHOW WARNINGS. Not so obvious was the reason for such large amount of this statement to arrive on our master.

Lo and behold, turns out that there is a neat feature in Hibernate, the ORM that we use here, that automatically causes a SHOW WARNINGS to be issued after each and every statement that is executed. To have this enabled it is  sufficient to set the Hibernate log level to "warning".  D'oh!!!

Of course, Hibernate's setting was the source of all troubles. We asked the devs for a quick fix and after bringing Hibernate's log level back to "error", the problem disappeared.

Still, the behaviour of MySQL when a large amount of SHOW WARNINGS is received was not proper: it would be very easy for anyone to stall any server by just flooding it with these statements, which do not require any special privilege. I felt that this is a serious bug (after all, it has caused downtime in production for us) so I decided to find a way to reproduce it, and to raise a bug as soon as I succeeded.

So I went to our 32 core test server and set up a sysbench test using this simple piece of LUA code:

pathtest = string.match(test, "(.*/)") or ""

dofile(pathtest .. "./sysbench/tests/db/common.lua")

function thread_init(thread_id)
   set_vars()
end

function event(thread_id)
   db_query("SHOW WARNINGS")

end

Then I ran sysbench in order to try and reproduce the stall.  The command line I used at first was the following:


./sysbench/sysbench --mysql-host=localhost  --mysql-user=sysbench --mysql-password=sbtest --mysql-db=sbtest --test=show_warnings.lua --max_requests=0 --max_time=20 --num-threads=30 --mysql-engine-trx=yes run

To my dismay, this did not produce the desired effect: there was no stall and the server was able to perform approximately 560,000 SHOW WARNINGS per second using 30 cores out of the 32 available on the server (keep in mind I was also running sysbench on the same machine).

This was really head scratching to me, as I was pretty sure that this statement was the root cause of our incidents.  So I decided to look at the connection string used by the offending application and noticed that there was link compression enabled. This was a common setup throughout the company, and was in place for historical reasons, to save bandwidth when we were on the old non-gigabit infrastructure. 

I  decided to add compression to the sysbench test and try again:

./sysbench/sysbench --mysql-host=localhost  --mysql-user=sysbench --mysql-password=sbtest --mysql-db=sbtest --test=show_warnings.lua --max_requests=0 --max_time=20 --num-threads=30 --mysql-engine-trx=yes --mysql-compression=on  run

Needless to say, that did the trick!! System CPU time skyrocketed and used over 80% of the 32 cores of the test server.  Now, this is absolutely weird - what can protocol compression have to do with SHOW WARNINGS?  Moreover, throughput fell down from 560k to 48k qps. 

I have filed this as a bug on Percona launchpad long ago, but there has been very little  activity on it, as you can see.  

I feel that this bug is a serious one, as Hibernate is widely used  at MySQL based shops, and there may be other ORMs that have similar options that can flood poor MySQL with thousands of SHOW WARNINGS per second.

Since I have verified that this is an upstream bug, today I have filed it on bugs.mysql.com as well, hopefully it will receive the attention it deserves.

Sunday, September 25, 2016

Introducing BakaSQL, the MySQL web-based sandbox for query execution

BakaSQL: a bit of history

When I joined this new company I was already aware there was a lot of work to do. There was no DBA Team to speak of; any MySQL related task was handled by the sysadmins, and the company was growing as fast as you can imagine, with 15-20 new employees arriving each month, most of them in the IT department. For sure, there was much fun ahead.

During my first week in the new job  I immediately felt that something was not right. I was receiving too much DMLs to execute each day, spanning a wide set of servers, schemas an tables. This is not something I had seen before, so I started asking around what the reason behind it was.  It turned out that there were multiple reasons behind it:

  • there were known bugs in the applications that were never fixed
  • application configuration items were stored in MySQL, but there was no admin panel to modify them
  • accounting department had to apply corrections to the accounting records, which were also stored in MySQL, but again they had no tool to do this
  • lastly, most developers did not have write  privileges to production servers
You may ask yourself why, for example, bugs were not fixed, or admin panels were not built to allow for modification of the config items, or why the accounting department did not have a tool to apply the needed corrections.  Well, so did I. The answer lied in the start-up nature of the company, where all effort were spent in developing new features, and there was little or no time left for fixing bugs or building administrative tools.  Although I did not agree at all with this approach, I had to live with it.

But spending most of the day executing DMLs on behalf of other teams certainly was not something I was happy with. Mind you, executing a DML can be a quick task, but backing up old rows that were going to be changed by said DML was time consuming and rather frustrating. But of course having a rollback plan was mandatory, as it can happen that the DML you have executed is not what the requestor actually wanted to execute... In any case, in order to bring everything back under control, I had to revoke write privileges from the developers who had them,  and this of course added up to the amount of DMLs that I had to handle on a daily basis. Ouch!

So I am in my 3rd week at the new job and the DMLs keep coming my way,  while there was much more important stuff to work on, like for example setting up proper backups, apply retention to data, consolidate, review,  improve performances... you name it.  So I had to find a way to speed up this DML execution thing, in order to save my time for something more important and especially more interesting and challenging.

In the next days, I started working on a bash script that was meant to be called via the CGI interface, from a web browser. At the beginning it was a very basic script which accepted user and password for the database and allowed to execute the DML that you pasted in the web form.  The script would parse the DML and, before executing it, would compute SQL statements that, if executed, would roll back the changes applied by the DML, saving these statements on file. In case a rollback was needed, I could just open this file and execute the code contained herein to revert the change.

Week after week, the small script improved, mainly because every time I received a DML that was not properly handled by the script (read: the rollback code was  not computed properly), I worked to improve it by having it handle the new DML. 

After few months, the small script had become a large script, and was handling almost any DML that was sent to me just fine. Still it was a bit on the slow side: parsing and building SQL using a bash script was not be the best option, of course there are dozens of other languages that would be more suited for this task, but please bear with me, as I am not a dev, I am just a DBA and although my development skills are there, they are rather old school. Anyways, I was receiving tickets with  up to 2000 update statements, or 1500 inserts, or 3000 deletes. Crazy, I know, but still I had to do something in order to handle them without having to wait 20 minutes each time.

I decided to rewrite some of the functions of the script using C, in order to boost execution speed, which was especially important when there was a large number of statements like in the examples above. So I introduced a poor man's profiler in the script, which allowed me to see where the script was spending most of its time.  After a week, I looked at the profiling data and rewrote some of the most frequently used and/or time consuming routines in C,  so that invocation of such routines in the script were replaced by a call to an external executable, which contained the replacement routine. 

The results were amazing, as now the script was able to execute even a large amount of statements in short time.  Over the next months some bugs popped out here and there, most of them being edge cases, and more features were added, for example proper handling of variables.

Fast forward to present day, the small then large script has now become our everyday tool. I named it BakaSQL (from a notorious Indian demon who is well known for its insatiable appetite) and it is not only used by the DBA Team, but also by some developers which can modify their config tables independently and in a quick and safe manner without the need to raise a ticket to our team.

Welcome, BakaSQL!


So how does BakaSQL work? 

It connects to a MySQL instance using the supplied credentials, and executes the supplied DML statement(s), which should be delimited by the usual semicolon.

BakaSQL has a dry run feature built in, which should be used to check the DML statements before executing them and, especially, to verify that the generated rollback code is proper. You should first run your DML with the tick on the dry run checkbox, then if everything looks good, you remove the tick from the dry run checkbox and execute again. When running in dry-run mode, the statements are effectively sent to the MySQL server, however, a ROLLBACK statement is issued at the end, instead of a COMMIT; this ensures a proper validation of the entire DML, including duplicate keys, FK constraint violations, etcetera.

One word of caution: the DML statements are executed within a single transaction. This is to enforce an all-or-nothing modification of the data. If you have a large number of statements in your DML, e.g. 2000 updates, please keep in mind that all the affected rows will be locked for the entire duration of the execution, until the commit (or rollback), and this may have side effects if some other process wants to modify one of these rows in the meantime. This also applies when running in dry-run mode.

So, here is what BakaSQL can offer:
  • automatic generation of rollback scripts that will revert the changes applied by your DML when executed
  • confirmation that every query uses a proper index (avoid locking table scans!!)
  • dry run capability (check DML statements and compute rollback code without modifying the data)
  • logging of everything executed against the database, including reference ticket # if supplied
  • transaction consistency: your DML code is run in a single transaction so an all-or-nothing modification is enforced
Rollback code is generated and shown to you during dry run. Please see the screenshot below for reference. When you execute, the rollback code is saved to a file named after the username that is executing, the ticket information, and the BakaSQL instance pid number.

BakaSQL will only execute DMLs. No DDLs are supported (better use pt-osc for those!) and also you cannot run SELECTs through this tool, with the exception of variable initialization (see below).

BakaSQL enforces some restrictions on what you can pass as DML; some are intended to avoid situations that could cause harm to the database, while others are to avoid unnecessary complications and slowdowns in the script:
  • backticks (backquotes) are not allowed as these create troubles within bash, but there is a checkbox that you can use to wipe them out (check it when runnng in dry run mode)
  • USE statement is not allowed, table names should be qualified if more than one schema is in use in the DML
  • SET NAMES is not needed and should not be included, as the tool will run the transaction with SET NAMES utf8 by default
  • SHOW WARNINGS, COMMIT, ROLLBACK are not supported, we already run our own transaction anyways
  • comments (double dash at start of line) should be removed from the code 
  • strings should be enclosed in single quotes; double quotes are not supported as string delimitator
  • INSERT IGNORE is not supported so ask the requestor to make up his mind up and decide whether he really wants that row inserted or not
  • auto_increment columns cannot be loaded with specific values, these are meant to be  populated by MySQL
  • WHERE conditions that have both an equality and an IN(), or multiple IN()s, are not supported (e.g.: where a = 1 and b in (2, 3) ) - these should be converted to all equalities
  • statements that use NOT in the WHERE condition are not accepted as index is rarely used in that case, these need to be converted  to avoid the NOT operator
  • statements that use OR in the WHERE condition are not accepted for the same reason as above; these need to be converted to multiple statements instead
  • subqueries are not supported, so if there is need to update some rows with the contents of some other table, CONCAT() can be used to generate updates by primary key that can be submitted to the tool
  • setting values in UPDATE statements using functions is not supported
There are also a few important limitations that you should be aware of when using the tool, some are listed above already but I want to stress about their importance so...
  •  BakaSQL only recognizes the single quote as a string delimiter, this is to make parsing simpler and faster; therefore, you need to properly escape single quotes in  your strings if there are any
  • BakaSQL does not like backticks (backquotes), these create problems within the bash script, however  there is a checkbox that you can use to quickly remove all of them from the DML before execution. If the damn developer likes to use MySQL reserved words as column names, you can check the backtick removal checkbox, run a dry run to wipe them all, then re-add the backticks only around the reserved names used as column, that will work and allow you to proceed
  • sometimes, when you receive DML through a ticket, there is some weird character in it. I have verified that Microsoft Word, and a few other editors, like to allow the user to specify UTF-8 characters that look exactly the same as their ASCII counterparts, but actually they are different. Specifiically, the double quote, the single quote and even the blank (space) character may have been replaced by some UTF-8 bizarre counterpart that BakaSQL will not digest. If you see weird behaviour when executing BakaSQL, check your DML code for those weird substitutes by copying the DML code into a 7 bit editor first
  • if the last statement in the DML does not get executed (or if the only statement in a single statement DML does not get executed) it is likely that the semicolon that should delimit the query (or the queries) is not there; please check that out!
Also, it is worth mentioning that there are a couple of edge cases where BakaSQL is not able to properly create rollback code:
  • the tool will not support changing (part of) a primary key with an update statement, as rollback code will be incorrect in this case; you will get an error if your DML tries to do that
  • if, in an update statement, the WHERE condition contains one of the columns touched by the update, rollback may be incorrect; BakaSQL will try to fix simple cases for you, but remember to always check the rollback code.... if this situation is detected, you will see a warning right before the rollback code pertaining to the offending statement
Although these limitations may seem too much of a burden at first, I can assure you that once your dev team gets acquainted with them, everything will work quite smoothly.


Variables


BakaSQL supports variables in the DML that is passed to it. Variable substitution happens in a recursive way, so it is also possible to have variables that reference other variables.  

Syntax for variable initialization is similar to standard MySQL, but only a specific syntax is supported by BakaSQL.  See following examples:

set @myvar1 =  (select val from rick.t1 where val=12121 and id = 1021);
set @myvar2 =  (select val from rick.t1 where id = 1385);
set @myvar3 =  (select 666);
insert into t1 (val) values (@myvar1);
insert into t1 (val) values (@last_insert_id), (@myvar2), (@myvar3);

Basically, the only way to initialize a variable will be :

set @variable = (select statement that returns one row);

but it is possible to assign one variable to another, using same syntax, as well to use a variable name in any place in a query that is part of your DML, as long as the variable has been initialized beforehand. Please note that you cannot reuse (reassign) variables, so you have to use a new variable name each time that you initialize a new one.

A special variable exists and is called @last_insert_id.  This variable is automatically initialized when you insert a row in a table which has an auto increment primary key, and you can use it as any other place in your DML. For this reason, LAST_INSERT_ID() function is not supported by the tool, as it is built in and automatically executed on your behalf if the tool detects that there is an auto increment column in the table that you are inserting into.

Installation


At this time, there is no packaging for BakaSQL (I think I already told you I'm a lazy guy), but its installation should be straightforward for anyone with basic Linux/UNIX skills. Basically, here is what you need to get it running:
  • a Linux/UNIX server (don't ask me about Windoze please)
  • Apache httpd (or nginx, or your preferred web server that supports CGI specs)
  • gcc compiler, to compile the BakaSQL utility
Installation requires that you know where your cgi-bin directory is. Place the bakasql.cgi script into it, make it executable by web server, and edit it to tailor the following variables that you find on top of the script:
  • HOSTFILE should point to a file that contains the list of the hostnames of your masters, one per line
  • BASE should point to a folder where BakaSQL will save logs, rollback code and other information; it should be writable by the user that the webserver runs under
  • MIN_REQ_CARDINALITY is the minimum cardinality, from 0 to 100, that you require for an index to be considered by BakaSQL good;  sometimes, especially for small tables, you will get a false negative about the cardinality of an index, please see Ninja Mode below for a workaround
  • BAKAUTILS is the path to the bakautils C program, see below
There are a few other variables that refer to the MySQL profiling instance, you can safely ignore these and comment out the profiling code that you will find in the profile_out() routine (just comment out the insert that you'll find there).

You also need to compile the bakautils.c program that contains the rewritten routines. You will need gcc for that purpose, just use the following command to compile the program:

      gcc -O3 -o bakautils bakautils.c

Once compiled, place the executable in the $BAKAUTILS path and ensure it is executable by the user your web server is running under.

Each user that wants to use BakaSQL needs MySQL privileges from the IP address of the web server that BakaSQL runs under.  The classic "INSERT,DELETE,UPDATE,SELECT" privileges are enough to use BakaSQL, no need to assign SUPER or other fancy privileges.

Make sure you create the $BASE/log and $BASE/rollback folders as I don't think that BakaSQL will create these for you :-)  These should be writable by the user the web server runs under. The former will contain log files (normal log and mysql debug log) and the latter will contain the rollback files for each DML successfully executed by BakaSQL. You will need to apply your own retention on these, as they are not deleted by BakaSQL.

When you are unsure what's going on (eg: BakaSQL seems to be hung during execution of a DML) the mysql debug log is  a good place to look into. It will contains a copy of anything that is sent by BakaSQL to your MySQL server, including server responses. If it seems stuck, check your MySQL server for locks, using SHOW ENGINE INNODB STATUS.

Ninja Mode


Sometimes you will see that BakaSQL  will refuse to execute your DML code and will say that there is no valid index, but an option called Ninja Mode will appear, allowing you to select its checkbox to override the limitation. This can be used in those situations where the cardinality of the index is low because the table is really small. Just ensure that the table is actually a small one, then click on the Ninja Mode checkbox and run execution again to override the cardinality check.


Where to get BakaSQL

You can always fetch the latest version of BakaSQL from my GitHub page here. Be sure to give me your feedback if you try it out!

Caveats


Although this tool has become an everyday tool for us, and is therefore stable and well tested, standard disclaimer applies. Your mileage may vary. Bugs may be there. Be sure to always check the rollback code using dry run before executing the changes. If you find bugs and/or have questions, I will be glad to help, just contact me using the form on the right side. Also, your comments are welcome!  I really hope this tool will help you as much as it helped me.

Enjoy!!





Saturday, September 24, 2016

Replication from MySQL 5.6 to MariaDB 10 may fail when using fractional seconds

One of the most interesting features  introduced in MariaDB 10 was without doubt multi source replication, or the ability for a slave to have multiple masters.

Tired of hearing complaints from developers who couldn't join tables because they were on different servers, I decided to give MariaDB a try to see if I could leverage this neat feature.

At the time, we had 5 main clusters, classic multi-master active/standby configuration, with some slaves under each of them. I wanted to create a "super slave" that would contain the dataset from all the five clusters, so that developers could connect to it and join at will.

The initial creation of the MariaDB superslave was easy. After installing the binaries, I just bootstrapped it with an xtrabackup copy of our main cluster and set up replication. All went just fine as expected.

Suddendly I realized that I couldn't use xtrabackup to bring the datasets from other masters in the superslave, as the neat xtrabackup tool (which I love) only works at tablespace level.  So, armed with lots of patience and time, I imported each dataset into the superslave using mysqldump (which can also save the master position) and after approximately two weeks, I had the superslave running with a impressive (at least for me) 7 TB dataset.

Everything ran just fine for some days, then suddendly I started getting replication errors for queries coming from a particular master. Basically, the slave couldn't find a row that on the master was deleted just fine.  My first aid response was to run a pt-table-checksum (yes, you can run that with a multisource slave if you are careful) and I was surprised to find hundreds  chunks with differences for that table, considering that we had no failures whatsoever since the superslave was bootstrapped. 

I decided to sync the table anyways. 

To my surprise, another checksum  ran shortly after the sync reported still a large amount  of differences.  This was really head scratching, especially considering that the problem only appeared on one specific table out of thousands we had imported  from the various clusters.
I started to suspect that some odd bug was biting me and this table, for some reason, was the only one triggering it. It was your average small logging table, nothing fancy, but I noticed that there was a datetime column as part of the primary key.  In any case, I had no idea what could cause the inconsistency between the master and the slave for this particular table by the look of it.

I decided to peruse the binary logs to see what was passed from master to slave. 
At that time, we were using MIXED replication format, which is by the way the default. 
Using mysqlbinlog -v I printed out INSERT statements from one of the most recent binary log files from the master, with the intention of comparing the values passed via replication with the actual values as found on the MariaDB slave.

I grepped the mysqlbinlog output  for the INSERT statements on this table, and noticed that all of them contained fractional seconds. This is actually quite common with Java ORMs and libraries, they like to pass millisecond information, however I knew that the DATETIME and TIMESTAMP columns in MySQL do not save this information by default:  you have to create the column including the desired precision  for that, otherwise the fractional part is discarded.

At least, this is what I was believing before actually comparing rows of this table between the master and the slave. ...

Comparing the MySQL master and the MariaDB slave, I found that, for some of the rows, there was a one second difference in the datetime column, which was part of the multicolumn primary key. 
That explained the replication errors, as of course, the PK on master and slave were, in fact, different and as a result, delete statements on the master couldn't find the row on the slave!

It was easy enough to correlate this data drift to a different behaviour between MySQL master and MariaDB slave regarding the fractional seconds handling: by looking at some of the rows on both master and slave, it was clear that the difference only existed when the fractional part was greater than 500 - the master was rounding to next second, while the slave was always truncating.

A quick test confirmed the above:

master>insert into test_datetime  values ('2015-10-05 10:00:00.501');
Query OK, 1 row affected (0.00 sec)
master>select * from test_datetime;
+---------------------+
| dt                  |
+---------------------+
| 2015-10-05 10:00:01 |
+---------------------+
1 row in set (0.00 sec)

MariaDB slave [rick]> select * from test_datetime;
+---------------------+
| dt                  |
+---------------------+
| 2015-10-05 10:00:00 |
+---------------------+
1 row in set (0.00 sec)

D'oh!

I realized then that the problem, although only showing up in this unfortunate situation, was actually much bigger than that, as Java was the main technology used in the Company and I was pretty sure that almost all inserted datetimes and timestamps contained the damn milliseconds!

Some googling suggested that, although MariaDB 10 had some new and neat  features backported from 5.7, and almost all the features from 5.6, was actually based on MySQL 5.5 - and the fractional seconds rounding in MySQL was only introduced in 5.6.4. 

Anyways, I raised a bug to MariaDB back in 2015 when I first discovered this, as it should at least be clearly documented that replicating from MySQL 5.6 into MariaDB 10 will lead to inconsistent slaves if fractional seconds are used.  You can check the status of the bug yourself...

I have actually found a workaround to this bug, which basically consists in switching to ROW based replication. when using ROW based replication the fractional seconds thing is no longer in play, as the binary log will contain the actual row content from the master, instead of a statement to insert the row values. 

My experience is based on MariaDB 10.0.21, but judging by the attention that the above bug has received, I believe this may still be valid today. 

Of course, it goes without saying that I had to spend 2 weeks running table checksum/table sync on all 7 TB of data in order to bring the superslave back to consistency, after setting replication format to ROW.  But it has been running flawlessly since.


Friday, September 23, 2016

Moving binary logs to a different filesystem without downtime

Another day at the office...

"Whoa, the write workload on our statistical cluster has  suddendly increased by 20% and the filesystem that holds the binary logs is no longer large enough".

Of course, I had warned the boss about this possibility when I received those servers with that tiny 250G filesystem for binlogs, but my red flag was just ignored as usual.

So here we are, presto I get this new shiny 600G LUN, but we need to stop the damn MySQL server in order to repoint the log_bin variable to the new storage area.

Dunno about you, but the idea of waking up at 2am to just perform a variable change is not something that makes me particularly happy. Not to mention the maintenance period that is needed around it....

So, I decided to investigate a bit about the possibilities to do such change without stopping the service.

As we all know, the log_bin variable is read-only and to change it, you need to restart mysqld.  So I wanted to check whether there was a way to actually swap the current binlog directory with another without making the server mad and/or lose data and/or causing downtime.

First and foremost I wanted  to check how the server would react with a rename of the binary logs folder. So I just issued the following:

mv /db/binlog /db/binlog_orig

I was very happy to find out that mysqld kept  running, and writing to the currently open file without any complaints.  I decided then to push the test further by actually linking a new folder to the old path and then switching to the next binlog.

So I created a symlink to the new binlog area first:

mkdir /db/binlog2
ln -s /db/binlog2 /db/binlog

I then connected to MySQL and issued a FLUSH BINARY LOGS. To my surprise, MySQL happily closed the current binary log (still being written in the original folder) and started a new one in the new folder. Oh,  and the index file was recreated there automatically too.  Cool :-)

Still something was not good, as the operation resulted in the binary log file number being reset to 1. Not cool if you have slaves (which you likely have, since you have binary logging enabled).

So I had to  guess how mysqld decides what the  number of the next binlog file should be.  My  first guess was that it should be looking at other files already in place. So  I repeated the steps above, but before actually issuing the flush logs command, I created an empty file in the new binary log folder with the same number as the one currently being written to in the old folder.
VoilĂ , that did the trick!  The new binlog file was now correctly numbered :-)

But what about slaves?  I did all these tests with no slave connected.
So I quickly set up a slave instance and set up replication, then I retried my binary logs path exchange trick.

Right after the binary log file switch, the slave stopped, complaining that it couldn't read the next file. Basically, it had finished reading the log written in the old path, but could not "see" the next one written in the new path. I was able to verify that the old log was entirely read, and just the switch to the next one had failed:


[ERROR] Error reading packet from server: could not find next log; the first event 'dbstat01.014676' at 103351600, the last event read from '/storage/binlog/dbstat01.016653' at 136744026, the last byte read from '/storage/binlog/dbstat01.016653' at 136744026. (server_errno=1236)


That position was exactly where the file, in the old path, was closed due to the flush:

-rw-rw---- 1 mysql mysql 136744026 Sep 23 14:30 dbstat01.016653

Okay I thought, I can just repoint the slave to next binlog, position 4. 
But if the slave complained, that means the IO_Thread is probably just still looking in the old path.  What if I just stop the slave and start it again?  
Pronto, STOP SLAVE; START SLAVE ...and the slave resumed replication from the new binlog  file in the new path just fine!

So happy that I have saved myself a nightly maint!
This was tested on Percona 5.6.31-77.0-log and previous versions, but I don't see why it shouldn't work on stock Oracle's or even on Maria.

If you are lazy like me, you can use the simple script that I have crafted for this test, you can find it on my GitHub page .