More on references

In a few different places I saw comments about my last blog post about references and performance where commentators noted that my example was pointless. Which of course is true and to some degree the point.
I read a lot of PHP code and from time to time I see people with a non-PHP background (or otherwise influenced) putting references everywhere they pass arrays or such in order to prevent copies. I knew this was a bad practice in PHP 5 and wanted to verify this in PHP 7. For readers with a stronger PHP background this doesn't come to mind and so comments are like "what if I want to modify the data?" which might lead to something like this:
function modify(&$data) { $data["foo"] = "bar"; } $data = [ /* huuuuuge array */ ]; modify($data);
In this code, from a performance perspective, the reference likely works out and this is "fast." My primary critic in this would be that references aren't idiomatic in PHP. Therefore most people reading this code wouldn't expect that $data is being changed in this function call. Luckily the name of the function give this away, to some degree. The more idiomatic way might be along those lines:
function modify($data) { $data["foo"] = "bar"; return $data; } $data = [ /* huuuuuge array */ ]; $data = modify($data);
I consider this more readable and clearer, while it will create a (temporary) copy, leading to more CPU and peak memory load. Now we have to decide how much clarity we want to take out of the code as compromise for a performance gain. After that decision has been made and we decided to go for the approach with references we fix an issue or add a new feature to our code and we make a slight change and suddenly loose what we've gained before. Maybe we do something like this:
function modify(&$data) { if (!in_array("bar", $data)) { // A copy happens here $data["foo1"] = "bar"; } if (!in_array("baz", $data)) { // Yet another copy here $data["foo2"] = "baz"; } } $data = [ /* huuuuuge array */ ]; $data2 = $data; modify($data); // A copy happens here, to split $data and $data2
So the performance gain we once carefully produced fired massively back to us and we even got three copies. In this short case this quite obvious, but in an larger application context with real life changes tracking this is really hard.
If we had written this in the (in my opinion) more idiomatic way this would look like this:
function modify($data) { if (!in_array("bar", $data)) { $data["foo1"] = "bar"; // Maybe a copy here } if (!in_array("baz", $data)) { $data["foo2"] = "baz"; // Maybe copy here, but only if not copied above already } return $data; } $data = [ /* huuuuuge array */ ]; $data2 = $data; $data = modify($data);
So depending on the conditions we might end up with either no or at most one copy, compared to the three copies from above. Of course this example is constructed but the point is: If you use references for performance you have to be extremely careful and know exactly what you're doing and think about each coming modification.
Now let's take a step back and think a bit more about this code. Isn't there yet another way? - We have data and we have functions operating on them. Wasn't there another construct which we might use? - Yes, we could go object-oriented!
class DataManager { private $data; public function __construct() { $this->data = [ /* huuuuuge array */ ]; } public function modify() { if (!in_array("bar", $this->data)) { $this->data["foo1"] = "bar"; } if (!in_array("baz", $this->data)) { $this->data["foo2"] = "baz"; } } } $dm = new DataManager(); $dm2 = $dm; $dm->modify();
Suddenly we have a higher degree of abstraction, encapsulation and all those other OO benefits and no copy of the data at all. Ok, yes I cheated: I didn't remember the purpose of the $dm2 = $dm assignment any more. So maybe we need to clone there and create an explicit copy. (While then again - for the $data property we'd probably benefit from copy-on-write making even the cloning quite cheap)
In summary: Yes, when careful you can be slightly more performant in both CPU and memory usage, but in real life that gain is often lost again and eventually fires back in maintenance cost and performance loss.
Now aren't there cases where references might be a good thing? - The only reason I found in recent times (except from an extremely carefully crafted tree structure I've seen, for which I'd usually suggest an OO way) is around anonymous functions/closures. Taking this example:
$data = [ /* ... */ ]; $oldsum = 0; $doubled = array_map(function ($element) use (&$oldsum) { $oldsum += $element; return $element * 2 }, $data);
Again, the example in itself might be bad, but in such a context where we provide a closure as callback and want to keep some "trivial" state references are a way which is ok. If the state we want to keep becomes more complex than a counter it, however, might be worthwhile to think about using an object to keep it or find some other code structure.
Thursday, February 25. 2016 at 08:21 (Reply)
All you seem to focus on, is performance. And it doesn't make much sense. For example, in the third example, you may never modify the array or creat copies, at all, it depends on what's already in the array - in your "corrected" example, you may end up copying huge arrays to add a single value.
Bottom line, don't use references as a means of affecting performance. But that doesn't mean you shouldn't use references at all, it just means you should have a practice reason to do so, meaning, not performance related. For example, multiple or optional return values in functions are a legitimate reason to use references. Granted, it's very rare, but there are other reasons to use references, and you haven't explored those at all.
The key take away from these two articles should be, don't choose references for performance reasons - the performance results are unpredictable with changing code... That's a fairly simple way to sum it up, and it doesn't mean references are evil - they're just misunderstood.
Thursday, February 25. 2016 at 15:08 (Reply)
And no, I'm not only thinking about performance. But in "stylistic" choices it becomes subjective. From readability I'd prefer the 2nd over the 1st example, even if it might cost some performance, while the "OO" approach from the 5th example is even nicer (while that depends on the full picture of the application)
For multiple return values my first concern would be that the function does too many things and some things should be split. If that's not possible returning an array (or object) might be better. But that has to be discussed with a precise example and has subjectivity.
Friday, February 26. 2016 at 08:17 (Reply)
If that's true, that's ridiculous. It shouldn't have to copy anything unless it makes a change - which is_array() does not. Do you know this for a fact?
Either way, yes, keeping your arrays privately inside an object is usually a better approach at large, since, for one, this lets you regulate what is allowed in the array.
I think the main problem with references, is not the feature itself, but the fact that people choose it for misguided performance reasons - actually this is a general problem in PHP, I think it stems from the fact that PHP historically had terrible performance, and it used to be somewhat necessary to garble your code with micro optimizations. This is much less the case today - especially with PHP 7. We should think much more about making code that is simple and comprehensible, rather than thinking about performance details now. But unlearning takes a while for a large community
Friday, February 26. 2016 at 13:22 (Reply)