Tuesday, January 10, 2017

Galera Cluster: adding a primary key to a MySQL table which lacks it... without downtime

OK, let me start with saying that a table without a primary key shouldn't be something that a DBA should ever stumble into.  Sure, InnoDB will secretly add one for us without telling - but that's not something we can use at application or administration level, it's just to make the engine happy.

So let's suppose you find some tables that lack a primary key, of course you need to do something about it, right?  Now, put Galera Cluster in the mix - Galera does not support tables without a primary key and will, secretly again, make that table inconsistent at cluster level.

You need to fix the damn table(s)!! And this is where the fun begins...  as you can't afford downtime for the operation so you need to resort to an online schema change of some type.

Galera cluster is a very nice HA solution and personally I have embraced it since few years now. I really love it, especially its ability to self-heal when something happens (and when you're in the cloud, there's  always something happening). Galera is very different from the classic MySQL replication, and the biggest difference is the fact that it replicates synchronously instead of asynchronously. Now, I will not get into details about how Galera works, there are plenty of blog posts and very good documentation online.  But there is something important that you need to be aware of, when running Galera:  in the standard operating mode of the cluster, any DDL that you will execute will freeze the entire cluster for the entire duration of the statement. They call this "Total Order Isolation" (TOI) and it is the default setting for the cluster.

Guess what. Months after migrating one of our clusters from classic master-master topology to Galera Cluster, I suddenly found myself in need to clean up some old and barely used schemas, which contained quite a few tables lacking a primary key (or any index, for that matter). My bad, I spent a lot of time checking the main schemas and tables and making sure they were optimized for Galera, but I left the unimportant ones behind and never went back checking on them.

So, here I am with my apparently simple (although boring) task - I need to add a primary key to a set of tables and I don't want downtime and/or block the cluster as a whole in the process.
Preliminary checks with DEVs confirmed that all INSERTs going into the tables I needed to fix were using an explicit field list, so the change was compatible (it is important to remember to check this, in order to avoid application failures).

So, first thing that I tried was to just add the PK using the great pt-online-schema-change tool from the Percona toolkit, which supports Galera Cluster just fine.  Unfortunately, I was quickly remembered (by an error message) that the tool cannot operate if the table lacks a primary key or unique index, except when the column that you want to be primary key is already present in the table, which was not my case. D'oh!

Failing that, I thought about running the ALTER statement directly, without using an online schema change tool.  Of course, this would have to be done on all nodes, one node at a time, after changing the wsrep_OSU_method global variable of the cluster to 'RSU'.  RSU stands for "Rolling Schema Upgrade" - more info here. In this mode, any DDL that you execute will only apply to the current node and will not be propagated to other nodes; the node will be evicted from the cluster during the operation, and will rejoin afterwards, but again, the DDL will NOT replicate. 
So you have to run it on each node in order to complete the change.  Boring, and not very practical, right. May be good to add an index on a write-heavy critical table, but of course not to change a table definition... especially if you got many tables to change! But I was prepared to do it at any cost anyways (I eventually felt guilty for not checking this before the migration), so I went to one of the nodes, changed the OSU method to RSU, and executed the following DDL:


ALTER TABLE myschema.mytable ADD id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

The operation completed  OK, but shortly thereafter, I lost connection to the node (infamous MySQL has gone away.....)  WTF?!?    I verified and the mysqld process was dead.  While it is not a big deal for a 3 node cluster to lose one node in our setup, I did not expect the node to die as a consequence of a DDL. But a quick look at the error log, and I immediately realized what the problem was:

2017-01-10 13:53:06 16667 [ERROR] Slave SQL: Column 0 of table 'myschema.mytable' cannot be converted from type 'char(3)' to type 'int(11) unsigned', Error_code: 1677
2017-01-10 13:53:06 16667 [ERROR] WSREP: Failed to apply trx 760719957 4 times
2017-01-10 13:53:06 16667 [ERROR] WSREP: Node consistency compromized, aborting...

What did just happen? Galera replicates between nodes using ROW based replication - this means that the row data gets replicated, instead of the statement that inserts or updates it. This is done for many good reasons (compatibility with non deterministic statements, for example). But in my situation, this was biting me hard - a row inserted by the application in one of the other cluster nodes reached my node where the table structure had been altered, the local node found that the first field of the row was of a different type, and failed as a result. In order to preserve consistency, after some unsuccessful retries, Galera aborted the MySQL server on the node.

Epic fail....

But I still needed to fix the damn tables. While the node was being (automatically) rebuilt, I spent some time thinking about a way to complete this task.  Replication had failed because the first column of the modified table was found to be different type vs. the first column of the table on other nodes. Hmmm, what about moving the new PK column to the end of the table instead of keeping it at the top, I thought. Ugly for sure, but I was curious to see if replication would work correctly in this case. In theory, the columns are matched by position in RBR, so replication should not complain after all.
Pronto, as soon as the node came back in service I set it once again in RSU mode and executed my DDL again:

ALTER TABLE myschema.mytable ADD id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;

Note that the only difference here is that I omitted the FIRST clause at the end this time. This will cause the new column to be added to the bottom of the table, as last column.
After executing the DDL I anxiously watched the table waiting for a replicated row to arrive... and I was really happy to see that rows were entering without problems now. Yay!

But wait, how could I now repeat these steps on the other nodes in a consistent way? The new shiny primary key was an auto_increment one (and it had to be, because the application couldn't populate it for me, it didn't even know this column existed),  but if I was to repeat the above DDL on other nodes, I would get different PK values inserted in each row - Galera Cluster uses different auto_increment values on each node by design (this is something we also use in classic master-master setups by the way).  One option could have been to play with the global variables and change the increment manually on each node before executing my DDL, then put it back to what it was.... but I felt this too dangerous and error prone, and also a bit complicated to repeat for each and every table that I had to fix.

And, last but not least, I wanted the damn PK to be first column in the table (yes I'm picky).

But wait again, now that I've got a PK on this table - pt-online-schema-change could do the hard work for me maybe? pt-osc normally works on the master, creating a copy of the original table with the desired changes, and populates it by writing chunks of rows read from the original table. The original table is never written to, and is not checked on the other nodes.  It shouldn't matter that the original table does not have a PK on the other nodes. I thought the tool should be able to complete my change.

I modified the wsrep_OSU_method and put it back to the default value of 'TOI', so that all changes would be propagated (remember that pt-osc executes DDLs to swap tables at the end of the copy process).  I fired the tool with the following --alter option:


modify column id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT FIRST

This basically told the tool to alter the table, and just move the id column on top.
pt-osc diligently created a copy of the table with the PK moved to 1st column, and filled it with all data from the original table that I had previously altered; then swapped the original and the new table.
That did the trick, since all the operations performed by pt-osc got propagated to all nodes of the cluster via replication. When pt-osc completed, the table was consistent on all nodes, and with a brand new auto_increment PK on top.

Mission accomplished!!

Bottom line: remember to always add primary keys to your tables...


ADDENDUM:

There is an important step that I forgot to mention in the above blog post (please bear with me) .  Of course the node that you set to RSU, and  where you run the ALTER statement(s), should not be receiving any write traffic during the operation, to avoid stalls.

But the very important step is: after you ALTER all your tables on the above node, and *before* you start running your pt-online-schema-change session(s), you need to make this node your only master (all writes need to go to this node!) until you're done with pt-osc. 
The reason is that the CREATE TRIGGER statements that pt-osc uses to keep your new table up to date while it runs, will fail on the other nodes, because the triggers needs to be created on the column you have  just added, and that column is not present on the other nodes yet.

Beware, if you don't follow the above step, all the changes performed by the applications to the table(s)  that happen while you run the pt-osc will be lost.
Make sure these changes only come to the node where you run pt-osc, and everything will be safe.