Using References to Improve Performance in PHP

miqrogroove
2012-02-27T02:31:09+00:00

Robert Chapin

Michigan

Draft 5/20/2011.  This paper has not been peer reviewed.  Please do not copy without author’s permission.

Abstract

Software coding patterns were analyzed to determine if the PHP scripting engine would demonstrate optimal performance when explicit variable references were used.  Several misconceptions about variable referencing in PHP were disproved.  A performance gain of 8668% was realized in the “copy-on-write” case when an explicit reference was used rather than allowing the implicit copy operation to occur.  In the “copy-on-read” case, all but one circumstance resulted in a performance penalty of less than 5% when an explicit reference was used rather than allowing the implicit reference to be used.  When processing large strings on the order of one megabyte, the “copy-on-read” performance penalty became a significant concern at 1298%.  More significant was a 10434% performance penalty discovered in function call chaining.  The PHP engine did not optimize function chains, with or without any combination of explicit references.

Introduction

If the topic of variable references makes your eyes cross, you are not alone.  I have more than 10 years experience writing website applications in PHP, and I still haven’t found a proper explanation of the variable referencing capability in this language.

I now intend to put to rest many of the mysteries and misconceptions about the performance implications of PHP variable references.

My personal preconceptions about the reference symbol (&) come from the C and C++ languages, as well as some experience using Intel assembly in MASM.  Without a doubt, string variables are not handled elegantly by any computer processor (CPU) because of their size relative to the bus width.  Strings are, and always should be, thought of as memory constructs like arrays of individual characters.  For this reason, strings should never be copied unnecessarily.  To achieve optimal performance, the programmer must be conscious of all syntactical nuances that might lead to such unnecessary work being done by the CPU.

Contrary to that logic, various PHP authorities have stressed the idea that referencing a variable to prevent it being copied actually makes PHP run slower due to super smart inner workings of the script engine.  From the PHP Manual itself, “Note that passing by reference doesn’t speed up your php script. PHP is smart enough not to simply copy data every time the language requires it.”  Currently, the most authoritative guidance I’m aware of on this topic is Schlüter (2010), who literally says, “References in PHP are bad. Do not use them.”

As a result of this culture that believes references are bad in PHP, it seems most programmers now avoid opportunities to use references in any and all situations.  Over at WordPress, the developers have been busy removing some of the object variable references, now considered obsolete, but seem to give no thought to how their strings are being passed between functions.  I had semi-formally hinted at this problem, but as far as I know, nobody had the spare time to test the various hypotheses.  Even before that, at XMB, I had littered the code base with new variable references on the assumption that they would be beneficial in the same ways as in the C language.  Since then, I have stopped adding new references due to the uncertainty of whether they are beneficial or not.

Code Patterns

Empirical testing requires formal definitions of performance gain or detriment, and explicit division of circumstances that may involve the same variable references but are syntactically unique.  I have no reason to assume that all references are created equal in the PHP engine.  The purpose of testing these various circumstances is not to infer or determine the inner workings of the PHP engine, but to determine if the same end result can be obtained in less time by using a particular “optimal” pattern.

Pattern Group 1: “On Write”

Much speculation pivots on the distinction between reading and modifying a single string variable, so I will repeat each pattern with and without a string value change.

Pattern 1: Pass by value, return a modified value to the same variable.  This seems to be the preferred function call pattern in PHP.

function myfunction($input);
$mystring = myfunction($mystring);

Pattern 2: Pass by reference, return a modified value to the same variable.  In this case, the function signature references the first parameter.  The function does not modify the $input parameter.  An explicit duplicate is created.

function myfunction(&$input);
$mystring = myfunction($mystring);

Pattern 3: Pass by reference, modify the value of the reference.  The start and end values of $mystring will be the same as in Patterns 1 and 2.

function myfunction(&$input);
myfunction($mystring);

Pattern 4: Pass by reference, modify the value of the reference, and also return that variable.  The utility of both modifying and returning a referenced parameter may be rare to none, but this pattern can be used in comparison with Patterns 19 and 20 to determine the effect in function chaining.  While this pattern should guarantee $newstring becomes a duplicate of $mystring, I do not know if such a duplicate is generated for a subsequent “on read” function chained with this “on write” function.

function myfunction(&$input);
$newstring = myfunction($mystring);

Pattern Group 2: “On Read”

These patterns are essentially the same as before, but with no changes being made to the string variable.

Pattern 5:  Pass by value, test the value, and return the test result to a different variable.

function myfunction($input);
$result = myfunction($mystring);

Pattern 6: Pass by reference, test the value, and return the test result to a different variable.

function myfunction(&$input);
$result = myfunction($mystring);

Pattern 7: Pass by value, test the value, do nothing.  This pattern will establish baseline performance of a “simple” function call.

function myfunction($input);
myfunction($mystring);

Pattern 8: Pass by reference, test the value, do nothing.

function myfunction(&$input);
myfunction($mystring);

Pattern Group 3: “Return by Reference”

One of the hypotheses to be stated below is that there is no benefit from returning references.  Therefore, patterns containing return-by-reference syntax are also tested.

Pattern 9: Pass by value, return a modified value, by reference, to the same variable.

function &myfunction($input);
$mystring = &myfunction($mystring);

Pattern 10: Pass by reference, return a modified value, by reference, to the same variable.  The function does not modify the $input parameter.  An explicit duplicate is created.

function &myfunction(&$input);
$mystring = &myfunction($mystring);

Pattern 11:  Pass by value, test the value, and return the test result, by reference, to a different variable.

function &myfunction($input);
$result = &myfunction($mystring);

Pattern 12: Pass by reference, test the value, and return the test result, by reference, to a different variable.

function &myfunction(&$input);
$result = &myfunction($mystring);

Pattern Group 4: “Implicit References by Function Chaining”

PHP does not allow explicit parameter referencing in certain situations.  It is inevitabile that these situations will have some performance implications relative to the patterns above.  The increasing complexity of these patterns makes them more relevant to practical application.  All of the patterns in this group will attempt to modify a string, first by one function, and then by passing that function’s return value to a second function that will test its value.  This will be done using syntactical variations of essentially the same algorithm, with the provisor that the signature of the second function cannot be modified because of a design requirement to be able to accept input from another function.  If significant performance variations are found among these patterns, it would imply that some of the language constructs in PHP are inadequate and should be avoided as a best practice.

Pattern 13: Pass by value, return a modified value to the same variable, pass by value again.

function myfunction($input);
function myroutine($input);
$mystring = myfunction($mystring);
myroutine($mystring);

Pattern 14: Pass by value, return a modified value to a second function.

function myfunction($input);
function myroutine($input);
myroutine(myfunction($mystring));

Pattern 15: Pass by reference, return a modified value to the same variable, pass by value.  The function does not modify the $input parameter.  An explicit duplicate is created.

function myfunction(&$input);
function myroutine($input);
$mystring = myfunction($mystring);
myroutine($mystring);

Pattern 16: Pass by reference, return a modified value to a second function.  The function does not modify the $input parameter.  An explicit duplicate is created.

function myfunction(&$input);
function myroutine($input);
myroutine(myfunction($mystring));

Pattern 17: Pass by value, return a modified value to a different variable, pass by value again.

function myfunction($input);
function myroutine($input);
$newstring= myfunction($mystring);
myroutine($newstring);

Pattern 18: Pass by reference, return a modified value to a different variable, pass by value.  The function does not modify the $input parameter.  An explicit duplicate is created.

function myfunction(&$input);
function myroutine($input);
$newstring = myfunction($mystring);
myroutine($newstring);

Pattern 19: Pass by reference, return a modified value to a different variable, pass by value.  This function does modify the $input parameter, and also returns that variable.  This is done to distinguish this pattern from Patterns 3, 18, and 20.  See also Pattern 4 for a performance baseline.

function myfunction(&$input);
function myroutine($input);
$newstring = myfunction($mystring);
myroutine($newstring);

Pattern 20: Pass by reference, return a modified value to a second function.  The first function does modify the $input parameter, and also returns that variable. See also Pattern 4 for a performance baseline.

function myfunction(&$input);
function myroutine($input);
myroutine(myfunction($mystring));

Hypotheses

In the hope of making my test results as impartial as possible, I brainstormed the possible implications of any one code pattern performing better than another.

  1. Passing a large string variable into a user-defined function will cause execution time to more than double, depending on argument referencing.
  2. Performance changes are negligible or opposite for small string variables, creating a “dilemma of scale”.
  3. String variables are implicitly referenced in functions that do not modify the variable, so parameter referencing does nothing.
  4. String variables are implicitly duplicated in functions that modify the variable, so referencing is always more efficient when the function’s sole purpose is to modify a variable.
  5. Variable referencing is always detrimental to performance.
  6. Returning a variable by reference never improves performance.  (This is clearly stated in the PHP Manual).
  7. Function “chaining” obviates the need for parameter referencing, the two of which are not allowed together in PHP.

Methods

A simple procedural script was developed to test each code pattern, one after another, and then report the time required to execute each of them.  To account for other variables that could affect the results, each pattern was run 1000 times in a simple loop.  To achieve an acceptable error of no more than 3%, several trials of the test script were run and the minimum times needed for each pattern were collected.    The total execution times of each trial and the average and minimum times for patterns across trials were monitored to ensure the results were consistent.

The procedure was run on a single x86 PHP server, version 5.3.1.

To determine the implications of operating on strings of different sizes, the procedure was executed using four different input strings.  The first string was equivalent to a plain-text term paper, 28 kB long.  The second string was equivalent to a large image file, 1161 kB long.  Due to the time needed to copy such large strings, the patterns were each looped 100 times instead of 1000 times.  The third string was equivalent to an e-mail message or forum post, 621 bytes long.  The fourth string was equivalent to a text box input or username, 12 bytes long.  These two shorter strings were looped 5000 times instead of 1000 times to achieve a low error.  Results presented below were adjusted for the number of loops so that they were all based on the same unit of measurement.

Results

All of the benchmark values are presented in seconds per 100,000 execution loops, with a precision of three decimal places, and an error of +/-3%.  Smaller numbers are better, indicating less time was needed.

Pattern # Short String Forum Post Term Paper Image File
1 0.261 0.286 1.006 344.703
2 0.240 0.263 0.948 351.516
3 0.096 0.095 0.100 3.976
4 0.250 0.278 1.232 466.436
5 0.138 0.137 0.138 0.349
6 0.140 0.141 0.145 4.530
7 0.122 0.121 0.122 0.143
8 0.122 0.125 0.125 4.433
9 0.269 0.289 1.264 468.826
10 0.244 0.260 1.281 467.267
11 0.138 0.140 0.139 0.168
12 0.137 0.140 0.144 4.457
13 0.379 0.404 1.513 461.929
14 0.226 0.280 1.326 451.810
15 0.357 0.383 1.369 465.825
16 0.225 0.289 1.282 473.746
17 0.235 0.289 1.843 452.194
18 0.237 0.298 1.453 466.920
19 0.352 0.446 1.236 466.219
20 0.347 0.365 1.156 466.093

Discussion

Pattern 1, the baseline, uses a very plain function to compute a new value of a given variable by adding the character “a” to the end of it, then replaces that variable’s value with the new value.  If PHP’s sole purpose in life was to carry out that one computation, it would be useful to know if Pattern 1 was the fastest way to accomplish that purpose.  Is it?  Absolutely not.  Hypothesis #1 is therefore true, and hypothesis #5 is false.

In particular, Pattern 1 computed the large image file in a time of 345 seconds per 100,000 loops.  The time difference when using Pattern 2 was within the margin of error.  Therefore, simply adding a parameter reference to the function signature has no significant impact on processing large files.  This makes perfect sense because both Pattern 1 and Pattern 2 force PHP to duplicate the string variable “on write.”

However, when the duplicate string variable and the return value are eliminated, as in Pattern 3, the computation time decreases to 3.98 seconds, or 87 times faster than Pattern 1.  The performance boost varies with string sizes, but the results show that Pattern 3 is always significantly faster than Pattern 1.  Therefore, hypothesis #4 is true.  Code having this pattern …

$mystring = myfunction($mystring);

… should always be replaced with a function that can be called like ….

myfunction($mystring);  // Passed by reference
function myfunction(&$input) {
$input .= 'a';
}

Can the same performance boost be arrived at by returning a reference?  Technically yes, but in order to get that boost it is still necessary to reference the function’s parameter.   So in practical terms, no, returning references does not improve performance.  Only returning references that were passed by reference improves performance.  The results show that Patterns 9 and 10 were both 35% slower than Pattern 1, making Pattern 3 up to 118 times faster than a return by reference pattern.  Hypothesis #6 is therefore true.  See below for an exception to this conclusion.

Can the same performance boost be arrived at through function chaining?  Again no, but here the situation is significantly worse.  Let’s say instead of simply adding a character to the end of a variable, PHP’s sole purpose in life was to add the character and then pass the new value into a second function.  If the second function is not being called from inside the first function, then the updated string value has to be passed from function to function at the global (calling) scope.  One way this could be accomplished is to execute Pattern 3 followed by Pattern 5.  By adding the results of those two patterns, it can be determined that 4.33 seconds per 100,000 loops would be required to carry out those two computations.

The comparable function chain, Pattern 14, is 104 times slower than calling the two functions individually.  It seems that function chaining is the real boogey man of the PHP language, whereas variable references can be hugely beneficial.  Hypothesis #7 is false.  Therefore, all other things being equal, code having this pattern …

myroutine(myfunction($mystring));

… should always be replaced with a pass-by-reference function and a separate subroutine call like …

myfunction($mystring);  // Passed by reference
myroutine($mystring);
function myfunction(&$input) {
$input .= 'a';
}

Do these findings hold true for a situation where the original string value and the modified string value need to be maintained in separate variables?  Yes and no.  There seems to be no penalty for doing it this way, but there is no significant difference between chained and unchained functions using two separate variables.  The conclusion is that the time cost of duplicating a string is the same whether it is done explicitly or due to PHP’s failure to optimize function chains. A barely-significant penalty of 3.3% for using a reference shows up between Patterns 17 and 18.  This seems relatively inconsequential compared to the 8668% penalty that was incurred by not using a reference in Pattern 1.  The important factor of multiple function calls is that they should never be chained if a string duplication can be prevented by explicitly referencing the variable.  Of course, if the argument being passed to the function can not be permitted to change, then there is no way to avoid creating a new (duplicate) variable for the function’s return value.

What about hypothesis #3?  Does parameter referencing do anything in the “on read” case?  The conclusion is: sometimes yes, sometimes no.  Ignoring the largest string size being tested, the large image file, all other cases showed that there was little or no difference between reading a referenced or non-referenced parameter.  For the short string of 12 bytes, time differences did not reach significance.  The forum-post-size string showed a 3.3% penalty for using references on read, and the term-paper-size string showed a 4.8% penalty.  Again, these cases seem inconsequential.  With a large-image-size string, parameter reading may become a concern.  In this one case, the pass by value Pattern 5 performed 13 times better than the pass by reference Pattern 6.

I overlooked the concept of string concatenation when developing the 20 code patterns.  This is okay because the timing details are not terribly interesting, and string concatenation performance could be a topic in and of itself.  However, while analyzing the results of this research I did come across one crucial corner case that pertained to references in PHP.  It turns out that any instance of concatenation involving the return value of a function …

$test = 'This is a test' . myfunction($mystring);  // Returned by value

… should always be replaced either by a single variable reference …

myfunction($mystring);  // Passed by reference
$test = 'This is a test' . $mystring;
function myfunction(&$input) {
$input .= 'a';
}

… or by returning a reference from the function itself …

$test = 'This is a test' . myfunction($mystring);  // Passed by reference, Returned by reference
function &myfunction(&$input) {
$input .= 'a';
}

While both of those improvements offer equal performance, there is an obvious advantage to the former.  The pass-by-reference function signature is reusable, as in all of the previous examples.  On the other hand, the return-by-reference function signature has no other common usage that would justify creating a separate function.  Note this improvement applies to use of the dot (.) operator as well as the dot assignment (.=) operator.  Also note the return by reference boost was observed without using the reference operator in the function call, as shown above.

Another concept I had overlooked in developing the patterns was static arguments.  It is unlikely anyone would use a very large string as a static argument, so I did some additional tests using only the short string and the midterm-size string.  In both cases, when the static argument was passed by value, the performance was identical to assigning the string to a new variable and then passing that variable by value.  The static argument pattern is slower than every other pattern tested in this study.  Therefore, any instance of code in the form …

$mystring = myfunction('This is only a test');

… should always be replaced by a variable and passed by reference …

$mystring = 'This is only a test';
myfunction($mystring);  // Passed by reference
function myfunction(&$input) {
$input .= 'a';
}

Hypothesis #2 is false.  Although the benefit from references is less with smaller strings, it is always more than a two-fold improvement in speed.  One interesting aspect of the results is that the short string and the forum-post-size string had nearly the same processing times, even though the latter was more than 50 times larger in size.  The term-paper-size string, which was more than two thousand times larger than the short string, needed only about four times as much processing.  In terms of seconds per gigabytes processed, the term paper seems to fall into an efficient sweet spot.  The image file, which was only about 40 times larger than the term paper, took hundreds of times longer to process.

Remember, PHP is not an omniscient scripting engine, and it is not capable of reading your mind.  Variable references are good because they specify the intention to use the same value in more than one place without making a copy of it.  I have established that the benefits of references are greater, in the extreme, compared to the nearly insignificant drawbacks.  Best practices for PHP programming should include the use of references in new code, as well as reviewing mature code for the four patterns that cause unnecessary duplication of values.

References

  1. Function arguments. (n.d.). In PHP Manual. Retrieved from http://php.net/manual/en/functions.arguments.php
  2. Returning references. (n.d.). In PHP Manual. Retrieved from http://www.php.net/manual/en/language.references.return.php
  3. Schlüter, J. (2010, January 10). Do not use PHP references [Web log message]. Retrieved from http://schlueters.de/blog/archives/125-Do-not-use-PHP-references.html
Comment Feed

7 Comments

  • Xing says:

    Thank you for this benchmark. After reading this I will revisit my code and benchmark my own modified hotspots.

    Non-Native PHP function calls have huge, and in my opinion, ridiculous penalties, and maybe you just proved why.

  • Paul Dragoonis says:

    Keeping this short;

    References are yes marginally quicker than by-value approaches but they should be used very cautiously. They’re bug-prone and going to catch you out sooner or later.

    In PHP, you don’t need to worry too much about little details of speed like this, it’s already a slow language being interpreted and all.

    Using lots of references is going to do a lot of harm to your PHP project, way beyond the point where you don’t even care about performance and instead trying to clean up your source of all the references.

    Thanks for the blog post, but I think advocating the use of references isn’t a positive thing for PHP.

    • miqrogroove says:

      I feel I should thank you for commenting because this so clearly illustrates the original need for the research.

      Anyone who writes PHP code worth paying for will eventually have to work with files and large strings. Being ignorant of the variable referencing syntax will doom that code to using twice as much memory and hundreds of times the CPU delay. I could not accept a program that only handles up to a 16 MB file using a 32 MB memory limit, for example. That is sloppy work.

      For small variables and situations resigned to using “a slow language,” fine let it be slow and hog memory. That’s not what this paper was about at all. I don’t mind that the PHP community discourages the use of references, but the lack of documentation about the feature does bother me.

  • nikic says:

    @Robert: No, no, no and no again. I think you completely misunderstood why it is a bad idea to use references as performance optimizations. The main problem is that you can optimize only for a very particular setup. If you change your script just a single bit the result can be inverted.

    By the way I don’t quite understand what you mean by “lack of documentation”. How refcount and is_ref on zvals works is well documented in many places. The mere fact that you did not use any of these terms clearly shows that you don’t really understand what you are writing about here. For example you largely fail to consider zvals with refcount>1 and is_ref=0 before passing by ref. In this case the zval will be separated regardless of whether or not it was actually changed. So the simple act of writing $foo = $bar; before passing $bar to a by ref function can trigger a zval separation. Or considering the other way around, passing an is_ref=1 zval to a non-ref function would trigger a separation. And that’s the main problem: A trivial change in the logic can break all your “optimizations”.

    Oh, and apart from the technical misconceptions there is also a much simpler reason why this is irrelevant: I seriously doubt that copying a string (unless done 100k times like in your benchmarks) will have any significance for execution time. A more realistic benchmark to do would be to read the image file and let it be copied once. In that case you would see that File IO takes up much more time than the simple string copy.

    • miqrogroove says:

      As already mentioned, I have no interest in the internal workings of PHP. I mean to address results only. This zval/refcount stuff is overly complex for the sake of understanding how a simple function call is supposed to behave. So, if you can demonstrate additional code patterns that run faster in variable writing situations without references, I would enjoy knowing about them. Until then, the heart of the matter is that certain code patterns do force PHP to make a copy of a variable, which might not be the programmer’s desired outcome.

      We are in agreement that the string copying, unless done many times repeatedly, requires only a fraction of a second. While it would also be possible to test concerns relating to the doubling of operational memory requirements, such tests would be relevant to large strings only. That is a more focused topic that could be the subject of yet another paper.

  • Jan Mark Salarda says:

    Thanks for this article. It helps me with my project which need to copy data strings from a large xml file to the database.
    This is helpful to me since I will be copying/passing string data to function at least 5000 times per file.
    Before I read this article I always got memory allocation errors. Even I used native MYSQL functions to insert strings.

    Lot of thanks to this article…
    Thanks for the research.

  • CanadaToo says:

    Great article… not everyone using PHP is a script kiddie.

    Having been in the work world for a long time I can’t fathom why people make so many assumptions about what languages are used, why they are used, and what actually constitutes value for effort.

Write a Comment