You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo.
I got sucked in because two of my colleagues have entered the competition and are on the leaderboard.
PHP is not known for its speed, but as I am working on the PHP profiler I thought I give it a shot and see how fast it can get.
A first naive approach
I cloned the repository and created the billion row dataset in measurements.txt
. After that I started building my first naive implementation of something that could solve the challenge:
<?php
$stations = [];
$fp = fopen('measurements.txt', 'r');
while ($data = fgetcsv($fp, null, ';')) {
if (!isset($stations[$data[0]])) {
$stations[$data[0]] = [
$data[1],
$data[1],
$data[1],
1
];
} else {
$stations[$data[0]][3] ++;
$stations[$data[0]][2] += $data[1];
if ($data[1] < $stations[$data[0]][0]) {
$stations[$data[0]][0] = $data[1];
}
if ($data[1] > $stations[$data[0]][1]) {
$stations[$data[0]][1] = $data[1];
}
}
}
ksort($stations);
echo '{';
foreach($stations as $k=>&$station) {
$station[2] = $station[2]/$station[3];
echo $k, '=', $station[0], '/', $station[2], '/', $station[1], ', ';
}
echo '}';
There is nothing wild in here, just open the file, use fgetcsv()
to read the data. If the station is not found already, create it, otherwise increment the counter, sum the temperature and see if the current temperature is lower than or higher than min or max and updated accordingly.
Once I have everything together, I use ksort()
to bring the $stations
array in order and then echo out the list and calculate the average temperature while doing so (sum / count).
Running this simple code on my laptop takes 25 Minutes 🤯
Time to optimise and have a look at the profiler:
The timeline visualisation helps me see, that this is clearly CPU bound, file compilation at the beginning of the script is negligible and there are no garbage collection events.
The flame graph view is helpful as well in showing that I am spending 46% of CPU time in fgetcsv()
.
fgets()
instead of fgetcsv()
The first optimisation was to use fgets()
to get a line and split on the ;
character manually instead of relying on fgetcsv()
. This is because fgetcsv()
does a whole lot more than what I need.
// ...
while ($data = fgets($fp, 999)) {
$pos = strpos($data, ';');
$city = substr($data, 0, $pos);
$temp = substr($data, $pos+1, -1);
// ...
Additionally I refactored $data[0]
to $city
and $data[1]
to $temp
everywhere.
Running the script again with just this one change already brought the runtime down to 19m 49s. In absolute numbers, this is still a lot but also: it's 21% down!
The flame graph reflects the change, switching to showing the CPU time by line also reveals what is going on in the root frame:
I am spending ~38% CPU time in lines 18 and 23, which are:
18 | $stations[$city][3] ++;
| // ...
23 | if ($temp > $stations[$city][1]) {
Line 18 is the first access to the $stations
array in the loop, otherwise it is only an increment and line 23 is a comparison, nothing that seems expensive at first glance, but lets do few more optimisations and you'll see what is taking time here.
Using a reference where possible
$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
// instead of
$stations[$city][3] ++;
$stations[$city][2] += $data[1];
This should help PHP to not search for the key in the $stations
array on every array access, see it like a cache for accessing the "current" station in the array.
And it actually helps, running this only takes 17m 48s, another 10% down!
Only one comparison
While looking at the code, I stumbled over this piece:
if ($temp < $station[0]) {
$station[0] = $temp;
}
if ($temp > $station[1]) {
$station[1] = $temp;
}
In case the temperature is lower than min, it cannot be higher than max anymore, so I made this an elseif
and maybe spare some CPU cycles.
BTW: I don't know anything about the order of temperatures in measurements.txt
, but depending on that order it could make a difference if I first checked the one or the other.
The new versions takes 17m 30s, which is another ~2%. Better than just jitter, but not by a lot.
Adding type casts
PHP is known as a dynamic language and it is something that I valued a lot when I just got started writing software, one less problem to care about. But on the other side, knowing the types helps the engine make better decisions when running your code.
$temp = (float)substr($data, $pos+1, -1);
Guess what? This simple cast makes the script run in just 13m 32s which is a whopping performance increase of 21%!
18 | $station = &$stations[$city];
| // ...
23 | } elseif ($temp > $station[1]) {
Line 18 still shows up with 11% of CPU time spend, which is the access to the array (finding the key in the hash map, which is the underling data structure used for associative arrays in PHP).
Line 23's CPU time dropped from ~32% to ~15%. This is due to PHP not doing type juggling anymore. Before the type cast, $temp
/ $station[0]
/ $station[1]
were strings
, so PHP had to convert them to float
in order to compare them on every single comparison.
What about JIT?
OPCache in PHP is by default disabled in CLI and needs the opcache.enable_cli
setting to be set to on
. JIT (as part of OPCache) is default enabled, but effectively disabled as the buffer size is set to 0
, so I set the opcache.jit-buffer-size
to something, I just went with 10M
. After these changes have been applied, I re-ran the script with JIT and see it finish in:
7m 19s 🚀
which is 45.9% less time spend!!
What more?
I already brought the runtime down from 25 minutes in the beginning to just ~7 minutes. One thing that I found absolutely astonishing is that fgets()
allocates ~56 GiB/m of RAM for reading a 13 GB file. Something seems off, so I checked the implementation of fgets()
and it looks like that I can spare a lot of these allocations by just omitting the len
argument to fgets()
:
while ($data = fgets($fp)) {
// instead of
while ($data = fgets($fp, 999)) {
Comparing the profile before and after the change gives this:
You might think that this gives a lot of performance improvement, but it is just ~1%. This is because those are small allocations which the ZendMM can handle in bins and those are blazingly fast.
Can we make it even faster?
Yes, we can! So far my approach was single threaded, which is the nature of most PHP software, but PHP does support threads in user land through the parallel extension.
Reading the data in PHP is a bottleneck as the profiler clearly shows. Switching from fgetcsv()
to fgets()
and manual split helps, but this is still taking a lot of time, so lets use threads to parallelise reading and processing the data and then afterwards combine the intermediate results from the worker threads.
<?php
$file = 'measurements.txt';
$threads_cnt = 16;
/**
* Get the chunks that each thread needs to process with start and end position.
* These positions are aligned to \n chars because we use `fgets()` to read
* which itself reads till a \n character.
*
* @return array<int, array{0: int, 1: int}>
*/
function get_file_chunks(string $file, int $cpu_count): array {
$size = filesize($file);
if ($cpu_count == 1) {
$chunk_size = $size;
} else {
$chunk_size = (int) ($size / $cpu_count);
}
$fp = fopen($file, 'rb');
$chunks = [];
$chunk_start = 0;
while ($chunk_start < $size) {
$chunk_end = min($size, $chunk_start + $chunk_size);
if ($chunk_end < $size) {
fseek($fp, $chunk_end);
fgets($fp); // moves fp to next \n char
$chunk_end = ftell($fp);
}
$chunks[] = [
$chunk_start,
$chunk_end
];
$chunk_start = $chunk_end;
}
fclose($fp);
return $chunks;
}
/**
* This function will open the file passed in `$file` and read and process the
* data from `$chunk_start` to `$chunk_end`.
*
* The returned array has the name of the city as the key and an array as the
* value, containing the min temp in key 0, the max temp in key 1, the sum of
* all temperatures in key 2 and count of temperatures in key 3.
*
* @return array<string, array{0: float, 1: float, 2: float, 3: int}>
*/
$process_chunk = function (string $file, int $chunk_start, int $chunk_end): array {
$stations = [];
$fp = fopen($file, 'rb');
fseek($fp, $chunk_start);
while ($data = fgets($fp)) {
$chunk_start += strlen($data);
if ($chunk_start > $chunk_end) {
break;
}
$pos2 = strpos($data, ';');
$city = substr($data, 0, $pos2);
$temp = (float)substr($data, $pos2+1, -1);
if (isset($stations[$city])) {
$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
if ($temp < $station[0]) {
$station[0] = $temp;
} elseif ($temp > $station[1]) {
$station[1] = $temp;
}
} else {
$stations[$city] = [
$temp,
$temp,
$temp,
1
];
}
}
return $stations;
};
$chunks = get_file_chunks($file, $threads_cnt);
$futures = [];
for ($i = 0; $i < $threads_cnt; $i++) {
$runtime = new \parallel\Runtime();
$futures[$i] = $runtime->run(
$process_chunk,
[
$file,
$chunks[$i][0],
$chunks[$i][1]
]
);
}
$results = [];
for ($i = 0; $i < $threads_cnt; $i++) {
// `value()` blocks until a result is available, so the main thread waits
// for the thread to finish
$chunk_result = $futures[$i]->value();
foreach ($chunk_result as $city => $measurement) {
if (isset($results[$city])) {
$result = &$results[$city];
$result[2] += $measurement[2];
$result[3] += $measurement[3];
if ($measurement[0] < $result[0]) {
$result[0] = $measurement[0];
}
if ($measurement[1] < $result[1]) {
$result[1] = $measurement[1];
}
} else {
$results[$city] = $measurement;
}
}
}
ksort($results);
echo '{', PHP_EOL;
foreach($results as $k=>&$station) {
echo "\t", $k, '=', $station[0], '/', ($station[2]/$station[3]), '/', $station[1], ',', PHP_EOL;
}
echo '}', PHP_EOL;
This code does a few things, first I scan the file and split it into chunks that are \n
aligned (for I am using fgets()
later). When I have the chunks ready, I start $threads_cnt
worker threads that then all open the same file and seek to their assigned chunk start and read and process the data till the chunk end, returning an intermediate result that afterwards gets combined, sorted and printed out in the main thread.
This multithreaded approach finishes in just:
1m 35s 🚀
Is this the end?
Nope, certainly not. There is at least two more things to this solution:
- I am running this code on MacOS on Apple Silicon hardware which is crashing when using the JIT in a ZTS build of PHP, so the 1m 35s result is without JIT, it might be even faster if I could use it
- I realised that I was running on a PHP version that was compiled using
CFLAGS="-g -O0 ..."
due my needs in my day to day work 🤦
I should have checked this in the beginning, so I re-compiled PHP 8.3 using CFLAGS="-Os ..."
and my final number (with 16 threads) is:
🚀 27.7 seconds 🚀
This number is by no means comparable to what you can see in the leaderboard of the original challenge and this is due to the fact that I ran this code on total different hardware.
This is a timeline view with 10 threads:
The thread at the bottom is the main thread, waiting for the results from the worker threads. Once those workers have returned their intermediate results the main thread can be seen working on combining and sorting everything. We can also clearly see, that the main thread is by no means the bottleneck. In case you want to try to optimise this even further, concentrate on the worker threads.
What did I learn on the way?
Each abstraction layer simply trades usability/integration for CPU cycles or memory. fgetcsv()
is super easy to use and hides a lot of stuff, but this comes at a cost. Even fgets()
hides some stuff from us but makes it super convenient to read the data.
Adding types to your code will help the language either optimise execution or will stop type juggling which is something you do not see but still pay for it with CPU cycles.
JIT is awesome, especially when dealing with CPU bound problems!
It is definitely not the nature of most PHP software, but thanks to parallelisation (using ext-parallel
) we could push the numbers down significantly.
The END?
I hope you had as much fun reading this article as I had. In case you want to further optimise the code, feel free to grab this and leave a comment how far you got.
[Edit to add the following on March, 18 2024]
There is MOAR!! performance to gain
After this blog post was released, some ideas arose from the comments how to make it faster and there were three suggestions from @xjakub
Remove the isset()
We might not need to check for isset()
, we can "just" create the reference to the station, it will be NULL
when the station does not exist. This means: one array access spared in case the city does exist (which is the majority of cases).
# before
if (isset($stations[$city])) {
$station = &$stations[$city];
// ..
# after
$station = &$stations[$city];
if ($station !== NULL) {
// ..
This shaves off around 2.5% of wall time!
Don't check on the fgets()
return value
The main loop that reads the file currently looks like this:
while ($data = fgets($fp)) {
$chunk_start += strlen($data);
if ($chunk_start > $chunk_end) {
break;
}
// ..
The added check for $chunk_start > $chunk_end
came with moving to parallelisation, as every thread only works on a part of the file. Now @xjakub mentioned that there is no need to check for the fgets()
return value anymore, as it will always return a string as long as we are in between $chunk_start
and $chunk_end
, meaning we could make this the expression in the loop and just read unchecked.
while ($chunk_start < $chunk_end) {
$data = fgets($fp);
$chunk_start += strlen($data);
// ..
This changes removes a branch from the loop and results in another ~2.7% drop in wall time!
fgets()
vs. stream_get_line()
The actual reading and storing in $city
and $temp
looks as followes:
$data = fgets($fp);
$chunk_start += strlen($data);
$pos2 = strpos($data, ';');
$city = substr($data, 0, $pos2);
$temp = (float)substr($data, $pos2+1, -1);
I had never heard of stream_get_line()
which behaves nearly identical to fgets()
besides that it allows you to specify the end of line delimiter!
$city = stream_get_line($fp, 99, ';');
$temp = stream_get_line($fp, 99, "\n");
$chunk_start += strlen($city) + strlen($temp) + 2;
$temp = (float)$temp;
This change shoved off another ~15% of wall time!
Why is that? The implementations for fgets()
and stream_get_line()
are really close, both are using the PHP stream layer. The major thing that changed is that we do not need to split a string (from fgets()
) into substrings using substr()
anymore. The additional strlen()
call is negligible as variables in PHP are zval's under the hood which hold the length of the string.
So where are we with the wall time for this PHP script?
@codemunky showed up in the comments and ran the benchmark on the same AX161 from Hetzner that the Java folks used to run their implementations on.
The final result (so far): 12.76 seconds 🎉
The END again?
I don't know, maybe there's still something to optimize here, but after managing to spend ~83% of the wall time in the stream_get_line()
function, it looks like we've achieved what the PHP stream layer allows.
Either we find another function that bypasses the PHP stream layer and gives more direct access to the filesystem or we try to optimise the layer itself.
Happy hacking!