Monday, November 6, 2017

when xtrabackup fails with 'read' returned OS error 0 after upgrading to 5.7

Here's something that has puzzled me for several weeks.
Right after migrating  MySQL from 5.6 to 5.7, we started experiencing random xtrabackup failures on some, but not all, of our slaves.
The failures were only happening when taking an incremental backup, and it would always fail on the same table on each slave, with errors similar to the following:

171106 13:00:33 [01] Streaming ./gls/C_GLS_IDS_AUX.ibd
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: 262144 bytes should have been read. Only 114688 bytes read. Retrying for the remaining bytes.
InnoDB: Retry attempts for reading partial data failed.
InnoDB: Tried to read 262144 bytes at offset 0, but was only able to read 114688
InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
InnoDB: Cannot continue operation.


Searching the web didn't yield any useful result.
The error was happening on one slave but not on  another, OS, MySQL and Xtrabackup versions being the same:
  • CentOS release 6.8 (Final)
  • xtrabackup version 2.4.8 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 97330f7)
  • mysqld  Ver 5.7.19-17 for Linux on x86_64 (Percona Server (GPL), Release 17, Revision e19a6b7b73f)
To properly identify the table where the backup was failing I had to set parallelism to 1 (option --parallel=1 of xtrabackup). Examining the table with SHOW CREATE TABLE didn't show anything particular, the backup was failing on different tables and schemas, and tables were all different from each other and didn't have anything special,  or at least nothing that would  give me an hint.

I even tried to dump the offending schema from a failing slave and import it into a non failing slave, but on the non-failing slave the same table in that schema backed up just fine.

This has been going on for a while, where I used to run without incremental backups on the migrated slaves, until today, when it happened again on yet another slave that we had just migrated, and I had  an intuition:  since the failure of the incremental backup was only happening after the migration, of course it had to be related to the migration itself. So I started suspecting some incompatibility between 5.7 and/or xtrabackup, and tablespaces created with MySQL 5.6.

Verifying my suspect was as easy as running:

  ALTER TABLE C_GLS_IDS_AUX Engine=InnoDB

Et voilĂ !! The rebuilt tablespace was not crashing the incremental backup anymore.
Eventually, another table(space) would crash the backup again later, but running a NULL alter table on it and relaunching the backup would fix that. Wash, rinse, repeat until the incremental completes, and you're done.

Hope this helps someone saving some time..
And if you are looking for xtrabackup based backup scripts, feel free to check my own set here on my GitHub page.


ADDENDUM
 
Looks like I have been a little overenthusiastic about this.
In fact, the problem reappeared the day after writing this blog post - apparently, the null alter table only fixes it temporarily.

Turns out this has to be some incompatibility between Xtrabackup 2.4.x and MySQL 5.7.x as I was able to reproduce the issue. Please see this bug if you are interested in the details.