TL;DR
- Make as many specs as possible be transactional (this can even be done for cucumbers!), especially for shared-example-using files where there are usually many examples.
- In those examples that actually write to DB for some reason, try switching from
:truncation
to:deletion
. In our case, runningpostgres
, oftentimes truncation randomly stalled for 2 minutes. Deletion sidesteps this.
The Story
It somehow came to my attention that there's a huge variance in how long parallel RSpec runners take on CI, sometimes spiking from the 12min average to 20min and more, and routinely exceeding the average by several minutes.
This seemed extremely suspicious because we use Knapsack, which should ensure near-equal finishing times for all runners.
Luckily, Knapsack stores run data and I was able to identify common offenders and identify the common thread - the spiking specs were writing to the database and then being cleaned up by DatabaseCleaner. The project had a complex DB setup, so I reached for the lowest-hanging fruit - I tried :deletion
instead of :truncation
cleanup strategy and it worked.
In the image you can see CI runs. Each vertical cluser of dots is the spread of how long each parallel runner took. Ideally we'd like to see very little spread, and have the dots as close to taking 0s as possible.
The Magenta line is showing when I merged some long-running spec rework from ones that write to DB to transactional ones. Due to the spread still being there it's hard to see, but it resulted in at least 60s off of the CI run average.
The Green line, however, is why you are here. This is where I merged the change from truncation to deletion. No more random spikes to 20+min runtimes.
Monitoring Is Key
Needless to say, without there being data on how long runs are taking, I wouldn't have had the opportunity to notice there something being amiss (besides sometimes having to wait on CI for much longer than usual). Access to quality data and trends can help prevent problems before they even arise.