Sunday, September 25, 2016

Introducing BakaSQL, the MySQL web-based sandbox for query execution

BakaSQL: a bit of history

When I joined this new company I was already aware there was a lot of work to do. There was no DBA Team to speak of; any MySQL related task was handled by the sysadmins, and the company was growing as fast as you can imagine, with 15-20 new employees arriving each month, most of them in the IT department. For sure, there was much fun ahead.

During my first week in the new job  I immediately felt that something was not right. I was receiving too much DMLs to execute each day, spanning a wide set of servers, schemas an tables. This is not something I had seen before, so I started asking around what the reason behind it was.  It turned out that there were multiple reasons behind it:

  • there were known bugs in the applications that were never fixed
  • application configuration items were stored in MySQL, but there was no admin panel to modify them
  • accounting department had to apply corrections to the accounting records, which were also stored in MySQL, but again they had no tool to do this
  • lastly, most developers did not have write  privileges to production servers
You may ask yourself why, for example, bugs were not fixed, or admin panels were not built to allow for modification of the config items, or why the accounting department did not have a tool to apply the needed corrections.  Well, so did I. The answer lied in the start-up nature of the company, where all effort were spent in developing new features, and there was little or no time left for fixing bugs or building administrative tools.  Although I did not agree at all with this approach, I had to live with it.

But spending most of the day executing DMLs on behalf of other teams certainly was not something I was happy with. Mind you, executing a DML can be a quick task, but backing up old rows that were going to be changed by said DML was time consuming and rather frustrating. But of course having a rollback plan was mandatory, as it can happen that the DML you have executed is not what the requestor actually wanted to execute... In any case, in order to bring everything back under control, I had to revoke write privileges from the developers who had them,  and this of course added up to the amount of DMLs that I had to handle on a daily basis. Ouch!

So I am in my 3rd week at the new job and the DMLs keep coming my way,  while there was much more important stuff to work on, like for example setting up proper backups, apply retention to data, consolidate, review,  improve performances... you name it.  So I had to find a way to speed up this DML execution thing, in order to save my time for something more important and especially more interesting and challenging.

In the next days, I started working on a bash script that was meant to be called via the CGI interface, from a web browser. At the beginning it was a very basic script which accepted user and password for the database and allowed to execute the DML that you pasted in the web form.  The script would parse the DML and, before executing it, would compute SQL statements that, if executed, would roll back the changes applied by the DML, saving these statements on file. In case a rollback was needed, I could just open this file and execute the code contained herein to revert the change.

Week after week, the small script improved, mainly because every time I received a DML that was not properly handled by the script (read: the rollback code was  not computed properly), I worked to improve it by having it handle the new DML. 

After few months, the small script had become a large script, and was handling almost any DML that was sent to me just fine. Still it was a bit on the slow side: parsing and building SQL using a bash script was not be the best option, of course there are dozens of other languages that would be more suited for this task, but please bear with me, as I am not a dev, I am just a DBA and although my development skills are there, they are rather old school. Anyways, I was receiving tickets with  up to 2000 update statements, or 1500 inserts, or 3000 deletes. Crazy, I know, but still I had to do something in order to handle them without having to wait 20 minutes each time.

I decided to rewrite some of the functions of the script using C, in order to boost execution speed, which was especially important when there was a large number of statements like in the examples above. So I introduced a poor man's profiler in the script, which allowed me to see where the script was spending most of its time.  After a week, I looked at the profiling data and rewrote some of the most frequently used and/or time consuming routines in C,  so that invocation of such routines in the script were replaced by a call to an external executable, which contained the replacement routine. 

The results were amazing, as now the script was able to execute even a large amount of statements in short time.  Over the next months some bugs popped out here and there, most of them being edge cases, and more features were added, for example proper handling of variables.

Fast forward to present day, the small then large script has now become our everyday tool. I named it BakaSQL (from a notorious Indian demon who is well known for its insatiable appetite) and it is not only used by the DBA Team, but also by some developers which can modify their config tables independently and in a quick and safe manner without the need to raise a ticket to our team.

Welcome, BakaSQL!


So how does BakaSQL work? 

It connects to a MySQL instance using the supplied credentials, and executes the supplied DML statement(s), which should be delimited by the usual semicolon.

BakaSQL has a dry run feature built in, which should be used to check the DML statements before executing them and, especially, to verify that the generated rollback code is proper. You should first run your DML with the tick on the dry run checkbox, then if everything looks good, you remove the tick from the dry run checkbox and execute again. When running in dry-run mode, the statements are effectively sent to the MySQL server, however, a ROLLBACK statement is issued at the end, instead of a COMMIT; this ensures a proper validation of the entire DML, including duplicate keys, FK constraint violations, etcetera.

One word of caution: the DML statements are executed within a single transaction. This is to enforce an all-or-nothing modification of the data. If you have a large number of statements in your DML, e.g. 2000 updates, please keep in mind that all the affected rows will be locked for the entire duration of the execution, until the commit (or rollback), and this may have side effects if some other process wants to modify one of these rows in the meantime. This also applies when running in dry-run mode.

So, here is what BakaSQL can offer:
  • automatic generation of rollback scripts that will revert the changes applied by your DML when executed
  • confirmation that every query uses a proper index (avoid locking table scans!!)
  • dry run capability (check DML statements and compute rollback code without modifying the data)
  • logging of everything executed against the database, including reference ticket # if supplied
  • transaction consistency: your DML code is run in a single transaction so an all-or-nothing modification is enforced
Rollback code is generated and shown to you during dry run. Please see the screenshot below for reference. When you execute, the rollback code is saved to a file named after the username that is executing, the ticket information, and the BakaSQL instance pid number.

BakaSQL will only execute DMLs. No DDLs are supported (better use pt-osc for those!) and also you cannot run SELECTs through this tool, with the exception of variable initialization (see below).

BakaSQL enforces some restrictions on what you can pass as DML; some are intended to avoid situations that could cause harm to the database, while others are to avoid unnecessary complications and slowdowns in the script:
  • backticks (backquotes) are not allowed as these create troubles within bash, but there is a checkbox that you can use to wipe them out (check it when runnng in dry run mode)
  • USE statement is not allowed, table names should be qualified if more than one schema is in use in the DML
  • SET NAMES is not needed and should not be included, as the tool will run the transaction with SET NAMES utf8 by default
  • SHOW WARNINGS, COMMIT, ROLLBACK are not supported, we already run our own transaction anyways
  • comments (double dash at start of line) should be removed from the code 
  • strings should be enclosed in single quotes; double quotes are not supported as string delimitator
  • INSERT IGNORE is not supported so ask the requestor to make up his mind up and decide whether he really wants that row inserted or not
  • auto_increment columns cannot be loaded with specific values, these are meant to be  populated by MySQL
  • WHERE conditions that have both an equality and an IN(), or multiple IN()s, are not supported (e.g.: where a = 1 and b in (2, 3) ) - these should be converted to all equalities
  • statements that use NOT in the WHERE condition are not accepted as index is rarely used in that case, these need to be converted  to avoid the NOT operator
  • statements that use OR in the WHERE condition are not accepted for the same reason as above; these need to be converted to multiple statements instead
  • subqueries are not supported, so if there is need to update some rows with the contents of some other table, CONCAT() can be used to generate updates by primary key that can be submitted to the tool
  • setting values in UPDATE statements using functions is not supported
There are also a few important limitations that you should be aware of when using the tool, some are listed above already but I want to stress about their importance so...
  •  BakaSQL only recognizes the single quote as a string delimiter, this is to make parsing simpler and faster; therefore, you need to properly escape single quotes in  your strings if there are any
  • BakaSQL does not like backticks (backquotes), these create problems within the bash script, however  there is a checkbox that you can use to quickly remove all of them from the DML before execution. If the damn developer likes to use MySQL reserved words as column names, you can check the backtick removal checkbox, run a dry run to wipe them all, then re-add the backticks only around the reserved names used as column, that will work and allow you to proceed
  • sometimes, when you receive DML through a ticket, there is some weird character in it. I have verified that Microsoft Word, and a few other editors, like to allow the user to specify UTF-8 characters that look exactly the same as their ASCII counterparts, but actually they are different. Specifiically, the double quote, the single quote and even the blank (space) character may have been replaced by some UTF-8 bizarre counterpart that BakaSQL will not digest. If you see weird behaviour when executing BakaSQL, check your DML code for those weird substitutes by copying the DML code into a 7 bit editor first
  • if the last statement in the DML does not get executed (or if the only statement in a single statement DML does not get executed) it is likely that the semicolon that should delimit the query (or the queries) is not there; please check that out!
Also, it is worth mentioning that there are a couple of edge cases where BakaSQL is not able to properly create rollback code:
  • the tool will not support changing (part of) a primary key with an update statement, as rollback code will be incorrect in this case; you will get an error if your DML tries to do that
  • if, in an update statement, the WHERE condition contains one of the columns touched by the update, rollback may be incorrect; BakaSQL will try to fix simple cases for you, but remember to always check the rollback code.... if this situation is detected, you will see a warning right before the rollback code pertaining to the offending statement
Although these limitations may seem too much of a burden at first, I can assure you that once your dev team gets acquainted with them, everything will work quite smoothly.


Variables


BakaSQL supports variables in the DML that is passed to it. Variable substitution happens in a recursive way, so it is also possible to have variables that reference other variables.  

Syntax for variable initialization is similar to standard MySQL, but only a specific syntax is supported by BakaSQL.  See following examples:

set @myvar1 =  (select val from rick.t1 where val=12121 and id = 1021);
set @myvar2 =  (select val from rick.t1 where id = 1385);
set @myvar3 =  (select 666);
insert into t1 (val) values (@myvar1);
insert into t1 (val) values (@last_insert_id), (@myvar2), (@myvar3);

Basically, the only way to initialize a variable will be :

set @variable = (select statement that returns one row);

but it is possible to assign one variable to another, using same syntax, as well to use a variable name in any place in a query that is part of your DML, as long as the variable has been initialized beforehand. Please note that you cannot reuse (reassign) variables, so you have to use a new variable name each time that you initialize a new one.

A special variable exists and is called @last_insert_id.  This variable is automatically initialized when you insert a row in a table which has an auto increment primary key, and you can use it as any other place in your DML. For this reason, LAST_INSERT_ID() function is not supported by the tool, as it is built in and automatically executed on your behalf if the tool detects that there is an auto increment column in the table that you are inserting into.

Installation


At this time, there is no packaging for BakaSQL (I think I already told you I'm a lazy guy), but its installation should be straightforward for anyone with basic Linux/UNIX skills. Basically, here is what you need to get it running:
  • a Linux/UNIX server (don't ask me about Windoze please)
  • Apache httpd (or nginx, or your preferred web server that supports CGI specs)
  • gcc compiler, to compile the BakaSQL utility
Installation requires that you know where your cgi-bin directory is. Place the bakasql.cgi script into it, make it executable by web server, and edit it to tailor the following variables that you find on top of the script:
  • HOSTFILE should point to a file that contains the list of the hostnames of your masters, one per line
  • BASE should point to a folder where BakaSQL will save logs, rollback code and other information; it should be writable by the user that the webserver runs under
  • MIN_REQ_CARDINALITY is the minimum cardinality, from 0 to 100, that you require for an index to be considered by BakaSQL good;  sometimes, especially for small tables, you will get a false negative about the cardinality of an index, please see Ninja Mode below for a workaround
  • BAKAUTILS is the path to the bakautils C program, see below
There are a few other variables that refer to the MySQL profiling instance, you can safely ignore these and comment out the profiling code that you will find in the profile_out() routine (just comment out the insert that you'll find there).

You also need to compile the bakautils.c program that contains the rewritten routines. You will need gcc for that purpose, just use the following command to compile the program:

      gcc -O3 -o bakautils bakautils.c

Once compiled, place the executable in the $BAKAUTILS path and ensure it is executable by the user your web server is running under.

Each user that wants to use BakaSQL needs MySQL privileges from the IP address of the web server that BakaSQL runs under.  The classic "INSERT,DELETE,UPDATE,SELECT" privileges are enough to use BakaSQL, no need to assign SUPER or other fancy privileges.

Make sure you create the $BASE/log and $BASE/rollback folders as I don't think that BakaSQL will create these for you :-)  These should be writable by the user the web server runs under. The former will contain log files (normal log and mysql debug log) and the latter will contain the rollback files for each DML successfully executed by BakaSQL. You will need to apply your own retention on these, as they are not deleted by BakaSQL.

When you are unsure what's going on (eg: BakaSQL seems to be hung during execution of a DML) the mysql debug log is  a good place to look into. It will contains a copy of anything that is sent by BakaSQL to your MySQL server, including server responses. If it seems stuck, check your MySQL server for locks, using SHOW ENGINE INNODB STATUS.

Ninja Mode


Sometimes you will see that BakaSQL  will refuse to execute your DML code and will say that there is no valid index, but an option called Ninja Mode will appear, allowing you to select its checkbox to override the limitation. This can be used in those situations where the cardinality of the index is low because the table is really small. Just ensure that the table is actually a small one, then click on the Ninja Mode checkbox and run execution again to override the cardinality check.


Where to get BakaSQL

You can always fetch the latest version of BakaSQL from my GitHub page here. Be sure to give me your feedback if you try it out!

Caveats


Although this tool has become an everyday tool for us, and is therefore stable and well tested, standard disclaimer applies. Your mileage may vary. Bugs may be there. Be sure to always check the rollback code using dry run before executing the changes. If you find bugs and/or have questions, I will be glad to help, just contact me using the form on the right side. Also, your comments are welcome!  I really hope this tool will help you as much as it helped me.

Enjoy!!





No comments:

Post a Comment