Aug 22: HashTables

In case you ever heard me talking about PHP internals I certainly mentioned something along the lines of "Everything in PHP is a HashTable" that's not true, but next to a zval the HashTable is one of the most important structures in PHP. While preparing my "PHP Under The Hood" talk for the Dutch PHP Conference there was a question on IRC about extension_loaded() being faster than function_exists(), which might be strange as both of them are simple hash lookups and a hash lookup is said to be O(1). I started to write some slides for it but figured out that I won't have the time to go through it during that presentation, so I'm doing this now, here:
You all should know PHP arrays. They allow you to create a list of elements where every element may be identified by a key. This key may either be an integer or a string. Now we need a way to store this association in an effective way in memory. An efficient way to store a collection of data in memory is a "real" array, an array of elements indexed by a sequence of numbers, starting at 0, up to n. As memory is essentially a sequence of numbered storage blocks (this obviously is simplified, there might be segments and offsets, there might be virtual pages, etc.) we can efficiently access an element by knowing the address of the first element, the size of the elements (we assume all have the same size, which is the case here), and the index: The address of the n-th element is start_address + (n * size_of_element). That's basically what C arrays are.
Now we're not dealing with C and C arrays but also want to use string offsets so we need a function to calculate a numeric value, a hash, for each key. An hash function you most likely know is MD5, MD5 is creating a 128 bit numeric value which is often represented using 32 hexadecimal characters. For our purpose 128 bit is a bit much and MD5 is too slow so the PHP developers have chosen the "DJBX33A (Daniel J. Bernstein, Times 33 with Addition)" algorithm. This hash function is relatively fast and gives us an integer of the value, the trouble with this algorithm is that it is more likely to produce conflicts, that means to string values which create the same numeric value.
Now back to our C array: For being able to safely read any element, to see whether it is used, we need to pre-allocate the entire array with space for all possible elements. Given our hash function returning a system dependent (32 bit or 64 bit) integer this is quite a lot (size of an element multiplied wit the max int value), so PHP does another trick: It throws some digits away. For this a table mask is calculated. The a table mask is a number which is a power of two and then one subtracted and ideally higher than the number elements in the hash table. If one looks at this in binary representation this is a number where all bits are set. Doing a binary AND operation of our hash value and the table this gives us a number which is smaller than our table mask. Let's look at an example: The hash value of "foobar" equals, in decimal digits, to 3127933054. We assume a table mask of 2047 (2¹¹-1).
3127933054 10111010011100000111100001111110 & 2047 00000000000000000000011111111111 = 126 00000000000000000000000001111110
Wonderful - we have an array Index, 126, for this string and can set the value in memory!
If life were that easy. Remember: We used a hashing function which is by far not collision free and then dropped almost two thirds of the binary digits by using the table mask. This makes it rather likely that some collisions appear. These collisions are handled by storing values with the same hash in a linked list. So for accessing the an element one has to
- Calculate the hash
- Apply the table mask
- locate the memory offset
- check whether this is the element we need, traverse the linked list.
Now the question initially asked was why extension_loaded() might be faster than function_exists() and we can tell: For many random reasons and since you have chosen a value which probably conflicts in the first, not in the second. So now the question is how often such collisions happen for this I did a quick analysis: Given the function table of a random PHP build of mine with 1106 functions listed I have 634 unique hash values and 210 hash values calculated from different functions. Worst is the value of 471 which represents 6 functions.
Full results are online but please mind: These results are very specific to my environment. Also mind that the code actually works on a copy of my function table so the table mask might be different, which changes results. Also note that the given PHP code won't work for you as I added special functions exporting the table mask and hash function to mz version of PHP. And then yet another note: doing performance optimizations on this level is the by far worst place as to many unknown factors go in. And you don't have any measurable performance win. Mind readability. Clear code is worth way more than the 2 CPU cycles you probably gain! But you may have learned how hash tables work.
Aug 20: References and foreach

References in PHP are bad, as I mentioned before, and you certainly should avoid using them. Now there is one use case which leads to an, at first, unexpected behavior which I didn't see as a real live issue when I stumbled over it at first, but then there were a few bug reports about it and recently a friend asked me about it ... so here it goes:
What is the output of this code:
<?php $a = array('a', 'b', 'c', 'd'); foreach ($a as &$v) { } foreach ($a as $v) { } print_r($a); ?>
We are iterating two times over an array without doing anything. So the result should be no change. Right? - Wrong! The actual result looks like this:
Array ( [0] => a [1] => b [2] => c [3] => c )
For understanding why this happens let's take a step back and look at the way PHP variables are implemented and what references are:
A PHP variable basically consists out of two things: A "label" and a "container." The label is the entry in a hash table (there are a few optimizations in the engine so it is not always in a hashtable but well) which may represents a symbol table of a function, and array or an object's property table. So we have a name and a pointer to the container. The container, internally called "zval", stores the value and some meta information, this container can also be a new hashtable with a set of labels pointing to other containers if we now create an reference this will cause a second label to point to the same container as another label. Both label from then on have the same powers of the container.
Now let's look at the situation from above. In a picture it looks like this:
So we have six containers (the global symbol table on the top, a container holding the array called $a on the left and one container for each element on the right) Now we start the first iteration. So the global symbol gets a new entry for $v and v is made a reference to the container of the first array element.
So an change to either $a[0] or $v goes to the same container and therefore has an effect to the other. When the iteration continues the reference is broken and $v is made a reference to the different elements. So after the iteration ends $v is a reference to the last element.
Remember: $v being a reference means that any change to $v effects the other references, in this situation $a[3]. Up till now nothing special happened. but now the second iteration begins. This one assigns the value of the current element to $v for each step. Now $v is a reference to the same element as $a[3] so by assigning a value to $v $a[3] is changed, too:
This continues for he next steps, too.
And now we can easily guess what will happen at the last step: $v is being assigned the value of the last element, $a[3], and as $a[3] is a reference to $v it therefore assignees itself to itself so effectively nothing happens.
And this is the result we saw above.
So to make this story full of pictures short: Be careful about references! They can have really strange effects.
Aug 19: MySQL at FrOSCon

Oh time is flying! - This weekend it is already time for FrOSCon, the Free and Open source Conference in St. Augustin close to Western Germany's former capitol Bonn. The conference consists out of a main track and different side tracks, like the PHP developer room and the OpenSQL sub-conference.
In the PHP developer room I will give an overview over things that happened at MySQL, especially in regards to PHP in recent times. My colleague Ulf Wendel will then go and talk about plugins to mysqlnd - the MySQL native driver for PHP - in detail.
In the OpenSQL Camp track you can find other interesting MySQL related talks which will, unfortunately, not leave you with enough time to watch all the interesting talks of the other tracks. And that all for an entrance fee of just 5€! So if you have a chance: Go there and say hi!
Aug 7: Scalar type hints in PHP trunk

So in my blog series I try to cover all additions to PHP trunk so I have to mention scalar type hints.
<?php function print_float(float $f) { echo $f."\n"; } for ($i = 1; $i < 5; $i++) { print_float( $i / 3 ); } ?>0.33333333333333
0.66666666666667
Catchable fatal error: Argument 1 passed to print_float() must be of the type double, integer given, called in typehints.php on line 7 and defined in typehints.php on line 2
Is expected behavior in PHP's trunk. If you want such a thing to work please use the numeric type hint.
In case that wasn't enought fun: There's more!
<?php function handle_result(int $i) { echo $i."\n"; } $pdo = new PDO("mysql:host=localhost;dbname=test", "user", "pass"); $pdo->setAttribute(PDO::MYSQL_ATTR_DIRECT_QUERY, false); $result = $pdo->query("SELECT 42 AS id"); $row = $result->fetch(); handle_result($row['id']); $pdo = new PDO("mysql:host=localhost;dbname=test", "user", "pass"); $pdo->setAttribute(PDO::MYSQL_ATTR_DIRECT_QUERY, true); $result = $pdo->query("SELECT 42 AS id"); $row = $result->fetch(); handle_result($row['id']); ?>42
Catchable fatal error: Argument 1 passed to handle_result() must be of the type integer, string given, called in typehints.php on line 16 and defined in typehints.php on line 2
So what happens here? - Depending on the PDO::MYSQL_ATTR_DIRECT_QUERY option PDO will either emulate prepared statements (which would be irrelevant here, but that's another story) or use native prepared statements. When using prepared statements MySQL switches over to its "binary" protocol which returns the native types, else it always returns strings. When using PDO_mysql linked against mysqlnd the native type is returned to PHP. If I would use libmysql both times a string would be returned. In other words: The behavior is system dependent. The default behavior for other drivers isn't defined either. As PHP is a dynamically typed language this shouldn't matter, so depending on your driver the results vary. Great. Oh and did I mention that the types aren't specified, so a later version of a driver might change it. And PDO is just a simple example for this ...
The key point of this all is that as soon as you mix strong typing - by using type hints - with PHP's weak type system you open a can of worms.
And to be overly clear: I can understand the need for strict type systems and see where such a thing helps. But adding a strong type system, stricter than most other languages (hey, you can't pass an integer where a float is expected!), makes little sense to me. But that's just me.
As in my previous posts in the series of blog posts about features in PHP trunk. Please try the snapshots. Mind that these features might (not) end up in the next release of PHP, feedback is welcome.