This post is archived and probably outdated.

More on references

2016-02-23 14:02:00

In a few different places I saw comments about my last blog post about references and performance where commentators noted that my example was pointless. Which of course is true and to some degree the point.

I read a lot of PHP code and from time to time I see people with a non-PHP background (or otherwise influenced) putting references everywhere they pass arrays or such in order to prevent copies. I knew this was a bad practice in PHP 5 and wanted to verify this in PHP 7. For readers with a stronger PHP background this doesn't come to mind and so comments are like "what if I want to modify the data?" which might lead to something like this:

function modify(&$data) {
    $data["foo"] = "bar";
}
$data = [ /* huuuuuge array */ ];
modify($data);

In this code, from a performance perspective, the reference likely works out and this is "fast." My primary critic in this would be that references aren't idiomatic in PHP. Therefore most people reading this code wouldn't expect that $data is being changed in this function call. Luckily the name of the function give this away, to some degree. The more idiomatic way might be along those lines:

function modify($data) {
     $data["foo"] = "bar";
     return $data;
}
$data = [ /* huuuuuge array */ ];
$data = modify($data);

I consider this more readable and clearer, while it will create a (temporary) copy, leading to more CPU and peak memory load. Now we have to decide how much clarity we want to take out of the code as compromise for a performance gain. After that decision has been made and we decided to go for the approach with references we fix an issue or add a new feature to our code and we make a slight change and suddenly loose what we've gained before. Maybe we do something like this:

function modify(&$data) {
    if (!in_array("bar", $data)) { // A copy happens here
        $data["foo1"] = "bar";
    }
    if (!in_array("baz", $data)) { // Yet another copy here
        $data["foo2"] = "baz";
    }
}
$data = [ /* huuuuuge array */ ];
$data2 = $data;
modify($data); // A copy happens here, to split $data and $data2

So the performance gain we once carefully produced fired massively back to us and we even got three copies. In this short case this quite obvious, but in an larger application context with real life changes tracking this is really hard.

If we had written this in the (in my opinion) more idiomatic way this would look like this:

function modify($data) {
    if (!in_array("bar", $data)) {
        $data["foo1"] = "bar"; // Maybe a copy here
    }
    if (!in_array("baz", $data)) {
        $data["foo2"] = "baz"; // Maybe copy here, but only if not copied above already
    }
    return $data;
}
$data = [ /* huuuuuge array */ ];
$data2 = $data;
$data = modify($data);

So depending on the conditions we might end up with either no or at most one copy, compared to the three copies from above. Of course this example is constructed but the point is: If you use references for performance you have to be extremely careful and know exactly what you're doing and think about each coming modification.

Now let's take a step back and think a bit more about this code. Isn't there yet another way? - We have data and we have functions operating on them. Wasn't there another construct which we might use? - Yes, we could go object-oriented!

class DataManager {
    private $data;

    public function __construct() {
        $this->data = [ /* huuuuuge array */ ];
    }
    public function modify() {
        if (!in_array("bar", $this->data)) {
            $this->data["foo1"] = "bar";
        }
        if (!in_array("baz", $this->data)) {
            $this->data["foo2"] = "baz";
        }
    }
}
$dm = new DataManager();
$dm2 = $dm;
$dm->modify();

Suddenly we have a higher degree of abstraction, encapsulation and all those other OO benefits and no copy of the data at all. Ok, yes I cheated: I didn't remember the purpose of the $dm2 = $dm assignment any more. So maybe we need to clone there and create an explicit copy. (While then again - for the $data property we'd probably benefit from copy-on-write making even the cloning quite cheap)

In summary: Yes, when careful you can be slightly more performant in both CPU and memory usage, but in real life that gain is often lost again and eventually fires back in maintenance cost and performance loss.

Now aren't there cases where references might be a good thing? - The only reason I found in recent times (except from an extremely carefully crafted tree structure I've seen, for which I'd usually suggest an OO way) is around anonymous functions/closures. Taking this example:

$data = [ /* ... */ ];
$oldsum = 0;
$doubled = array_map(function ($element) use (&$oldsum) {
    $oldsum += $element;
    return $element * 2
}, $data);

Again, the example in itself might be bad, but in such a context where we provide a closure as callback and want to keep some "trivial" state references are a way which is ok. If the state we want to keep becomes more complex than a counter it, however, might be worthwhile to think about using an object to keep it or find some other code structure.