Escaping from the statement mess
One of the issues Web Developers face is making their application robust to prevent SQL injection attacks. Different approaches exist which help. Sometimes people use large abstraction layers (which, sometimes, don't make anything safe ...) and sometimes people use prepared statements as a way to secure queries. Now prepared statements were a nice invention some 30 years ago abut they weren't meant for making things secure and so they do have some shortcomings: One issue is that preparing and executing a query adds a round-trip to the server where it then requires resources. In a classic application this is no issue. The users starts the application up early in the morning and processes data multiple times so the prepared statement handle is re-used quite some time. The system benefits from early optimisations. In a typical PHP Web application this isn't the case. A request and therefore a database connection with its associated statement handles lives way less than a second before being thrown away. The PDO MySQL driver, by default, tries to improve that by emulating the prepared statement on the client side. This emulation faces issues as it is lacking the knowledge of what's valid SQL which can lead to strange behaviour (The simple example is $pdo->prepare("SELECT * FROM t LIMIT ?")->execute(array($_GET['count'])); which will emit an SQL syntax error) and inherits limitations from prepared statements. A second issue with prepared statements is that queries are being built dynamically. A common case which is hard to do with prepared statements is the IN() clause with a dynamic amount of values. With prepared statements you first have to build the list of place holders (the exact amount of place holders (?) separated by a comma, without trailing comma) and then bind the values and mind the offsets when having other values - this typically becomes ugly code.
So why not take a step back. - Let's not try to emulate prepared statements but try to make it simpler to construct queries while escaping data?
An API for doing this might follow the sprintf() semantics and look like this;
$sql = mysqli_format_query($mysqli, "SELECT * FROM t WHERE f1 = %s AND f2 = %i", "foobar", 23);
which would return a string
SELECT * FROM t WHERE fi = 'foobar' AND f2 = 23
which can safely be send to the database. As said the IN clause should work. as we're in PHP we might simply extend it to do this:
$sql = mysqli_format_query($mysqli, "SELECT * FROM t WHERE f1 IN (%s)", array("foobar", 23));
SELECT * FROM t WHERE f1 IN ('foobar', '23')
Well doesn't look fancy? - But there's more: By not pretending to emulate prepared statements we can easily work with more dynamic queries. Something along the lines of
$sql = mysqli_format_query($mysqli, "SELECT * FROM t WHERE uid = %i", $_SESSION['uid']);
if (isset($option['option1']) {
$sql .= mysqli_format_query($mysqli, "AND option1 = %s", $option['option1']);
}
if (isset($option['option2']) {
$sql .= mysqli_format_query($mysqli, "OR option2 = %s", $option['option2']);
}
Doing such a thing using prepared statements or in some classic way becomes way harder to maintain. For playing with this approach I quickly cooked up a simple implementation of that logic which should work well with PHP 5.3 and mysqli:
function mysqli_format_query(mysqli $conn, $query /*, ... */) {
$args = func_get_args();
array_shift($args);
array_shift($args);
return mysqli_format_query_v($conn, $query, $args);
}
function mysqli_format_query_v(mysqli $conn, $query, $args) {
$modify_funcs = array(
's' => function($v) use ($conn) { return "'".$conn->real_escape_string($v)."'"; },
'i' => function($v) { return (int)$v; },
'f' => function($v) { return (float)$v; }
);
return preg_replace_callback(
'/%(['.preg_quote(implode(array_keys($modify_funcs))).'%])/',
function ($matches) use ($conn, &$args, $modify_funcs) {
if ($matches[1] == '%') {
return '%';
}
if (!count($args)) {
throw new Exception("Missing values!");
}
$arg = array_shift($args);
if ($arg instanceof Traversable) {
$arg = iterator_to_array($arg);
$arg = array_map($modify_funcs[$matches[1]], $arg);
return implode(', ', $arg);
} elseif (is_array($arg)) {
$arg = array_map($modify_funcs[$matches[1]], $arg);
return implode(', ', $arg);
} else {
$func = $modify_funcs[$matches[1]];
return $func($arg);
}
},
$query
);
}
Which leads to questions like these:
- Is this useful at all?
- Is it useful to have in the driver? (aka. in C as part of the PHP distribution)
- What are the exact requirements?
- This might have to much magic in there (and older version required %as for binding and array of strings, now just %s and passing an array is enough)
- Should it have special handling for NULL?


Wednesday, May 18. 2011 at 15:18 (Link) (Reply)
Wednesday, May 18. 2011 at 15:25 (Reply)
Wednesday, May 18. 2011 at 15:28 (Reply)
When looking at Zend_DB and Doctrine2 documentation and their explanations why bind-APIs are so much superior to plain vanilla sprintf() + real_escape_string() because it is soo much less code with the bind-API, this format_query() idea came to my mind and I felt like we should add it to the mysqli extension immediately. mysqli_format_query() gives you short and readable code.
Sunday, May 22. 2011 at 16:49 (Link) (Reply)
Wednesday, May 18. 2011 at 16:08 (Reply)
Wednesday, May 18. 2011 at 16:24 (Reply)
Wednesday, May 18. 2011 at 17:08 (Link) (Reply)
argument swapping or named substitution support would be an even better solution for me.
and use one arguments array not variable number of arguments (ala vsprintf instead of sprintf)
Tyrael
Wednesday, May 18. 2011 at 17:13 (Reply)
The "v" version is implemented in the example.
Wednesday, May 18. 2011 at 17:16 (Link) (Reply)
As far as the performance factor, you're talking one additional TCP round trip. That's adding maybe 5ms on average to a query. Not to mention that hardware is cheap, so err on the side of security and use true prepared statements (MySQLi) and be better off on all fronts. Otherwise it seems to me to be nothing more than a micro-optimization (which is evil in most cases)...
Thursday, May 19. 2011 at 10:31 (Reply)
There's a server behind which has to do the prepare. It is not said that prepare will be a cheap and fast operation in MySQL forever.
And, the prepare occupies server resources.
Wednesday, May 18. 2011 at 22:02 (Link) (Reply)
http://pooteeweet.org/blog/1201
Notice also the comment from Brian pointing to his thoughts:
http://krow.livejournal.com/599921.html
I think one thing is key with whatever we do, it shouldnt be an island solution. We should work with all OSS RDBMS and other scripting languages to work this out. Heck might even be relevent for some NoSQL DB's. Its certainly also relevant for JCR with their SQL2 syntax.
Wednesday, May 18. 2011 at 22:11 (Link) (Reply)
https://github.com/doctrine/dbal/blob/master/lib/Doctrine/DBAL/Connection.php#L689
https://github.com/doctrine/dbal/blob/master/lib/Doctrine/DBAL/SQLParserUtils.php#L86
Its actually much more complex than your solution because it still creates a prepared SQL and binds the parameters using type constants.
I think your API looks really neat, but its missing two features:
1. %s should be allowed in literals without quoting.
2. You should be able to extend the range of possible parameters. %d = DateTime, %g = MyGeoCoords class that can be put into a binary value and such.
Yes, please implement this on the C level, preferrable in a thin layer that every driver can use to emulate prepared statements (PDO2? I talked to it with Ulf). It would make my life much easier
Friday, May 20. 2011 at 13:51 (Link) (Reply)
I thought about date, date always has the issue of handling the timezone properly. And a single date format is probably not enough - do you want the timestamp or date or date and time? __toString is called - might help sometimes (not always, I know)
Thursday, May 19. 2011 at 09:18 (Reply)
- parameters are not passed to the escape function using unnamed parameters and accessing them via func_get_args. Instead, all parameters are passed as an array. This has proven to be slightly faster than using func_get_args() all the time.
- in our implementation, the parameter type matters, e.g. all strings are automatically put inside quotes, and integers are not. That means the replace function has a switch/case to handle the different PHP datatypes. Arrays are handled too by recursively invoking the replace function for the elements.
- in our implementation, the types are not distinguished in the query string (e.g. %s vs. %i), but the types are determined by the actual runtime data types of the parameters.
- however, parameters for use in a LIKE statement have to be escaped in a slightly differently way than other parameters, e.g. in a like clause, the % and _ characters themselves must be escaped, but in other parts of the query, they must not be escaped. This cannot be handled by just using one %s wildcard but you might need two different wildcards for that (i.e. one special wildcard for parameters to be used in a LIKE clause.
- simply type casting using sprintf has a few issues, e.g.: php -r "var_dump(sprintf('%f',1.0e998));"
string(3) "INF"
- one might or might not want to handle NULL values. Making the escape function capable of handling NULL values might require a change to the SQL query (e.g. transforming a WHERE clause "=NULL" should rather be an "IS NULL" clause instead) and this shouldn't be the task of an escape function. Instead, the escape function should probably fail hard when one of the parameters to be injected is a PHP null value.
Friday, May 20. 2011 at 13:46 (Link) (Reply)
I don't like the type auto-magic too much. This leads to issues like the LIMIT case where a numeric literal is expected, but you might accidentally pass a numeric string. Additionally it implicitly documents the logic. But well, that's of course matter of taste. I'm interested in more opinions
The LIKE comment is interesting. didn't think about that, yet.
The INF case is related to all error handling. Passing an not expected type in my code above you might have some "bad" behaviour. As there are no real checksright now. Didn't think through all cases, yet.
For NULL I was (and am) hoping for a great idea, that's why I mentioned it.
Thursday, May 19. 2011 at 10:47 (Link) (Reply)
Friday, May 20. 2011 at 13:38 (Link) (Reply)
The connection handle is needed for using the proper charset.
Friday, May 20. 2011 at 00:39 (Reply)
Friday, May 20. 2011 at 01:43 (Reply)
Btw, Perl's DBI has had support for caching prepared statements for long time. Of course they weren't really server-side for Mysql until recently.
Ajit
Friday, May 20. 2011 at 10:27 (Reply)
Sunday, May 22. 2011 at 18:59 (Reply)
Wednesday, June 22. 2011 at 10:07 (Reply)
Do you think its likely that a native query formatting function (similar to the above) will be integrated into an upcoming PHP version any soon?
I'd still consider some function like the above useful because all home-brewn PHP-based query formatting solutions could be replaced by a (hopefully better) standard solution then.