High-Performance .NET CRON Jobs

James Hickey - Jul 16 - - Dev Community

CRON jobs are a staple for many software systems:

  • Nightly reporting jobs
  • Processing a backlog of queued long-running work
  • Kicking off database clean-up scripts
  • Periodic calculations of customer subscription renewals and payments
  • etc.

See a recent article by Slack, for example, where they talk about their own CRON setup:

Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome.
https://slack.engineering/executing-cron-scripts-reliably-at-scale/

In the .NET ecosystem, there are a few great libraries for scheduling or queuing background work. I created Coravel as an easy way to build .NET applications with more advanced web application features. But it's mostly known as a background job scheduling library.

I thought it would be fun to play around with the idea of building a basic CRON job system and progressively building it into a more high-performance CRON job processing system.

We'll start by learning how to use Coravel in a simple scenario. Then, we'll further configure and leverage Coravel's features to squeeze more performance out of a single .NET process. Finally, you'll learn a few advanced techniques to build a high-performance background job processing system.

Note: I'm using a copy of the Wide World Importers database for this exercise using an SQL Server docker container as my backing sample data.

Code/Repository

You can see the sample code repository on GitHub.

Building A Basic CRON Job Process

First, I've installed some packages like Coravel, Dapper and the usual stuff to get a basic .NET console application up and running.

Here's Program.cs from my project named Basic:

using Basic;
using Coravel;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;

var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddScheduler();
builder.Services.AddTransient<string>(p => "Server=127.0.0.1,1433;Database=WideWorldImporters-Standard;User Id=sa;Password=P@assword;TrustServerCertificate=True;");
builder.Services.AddTransient<ProcessAllOrdersInvocable>();

var host = builder.Build();

host.Services.UseScheduler(scheduler =>
{
    scheduler.Schedule<ProcessAllOrdersInvocable>()
        .EverySeconds(5)
        .PreventOverlapping(nameof(ProcessAllOrdersInvocable));
});

await host.RunAsync();
Enter fullscreen mode Exit fullscreen mode

We're scheduling one job, ProcessAllOrdersInvocable, to run every 5 seconds. If another instance of this job is running at a given due time, we skip that run. This way it doesn't step all over its own feet 😅.

Here's the Coravel invocable ProcessAllOrdersInvocable:

using System.Diagnostics;
using Coravel.Invocable;
using Dapper;
using Microsoft.Data.SqlClient;

namespace Basic;

public class ProcessAllOrdersInvocable : IInvocable
{
    private string connectionString;

    public ProcessAllOrdersInvocable(string connectionString)
    {
        this.connectionString = connectionString;
    }

    public async Task Invoke()
    {
        var lastIdProcessed = 0;
        var watch = new Stopwatch();
        watch.Start();

        await using var connection = new SqlConnection(this.connectionString);

        while (true)
        {          
            var items = (await connection.QueryAsync<(int OrderId, string OrderDate)>
                (SQL, new { LastIdProcessed = lastIdProcessed })).AsList();

            if (!items.Any())
            {
                break;
            }

            var tasks = new List<Task>(items.Count);
            foreach (var item in items)
            {
                tasks.Add(SimulateProcessOrderAsync(item));
            }

            await Task.WhenAll(tasks);

            lastIdProcessed = items.Last().OrderId;
        }

        watch.Stop();
        Console.WriteLine($"### {nameof(ProcessAllOrdersInvocable)} took {watch.ElapsedMilliseconds} ms");
    }

    private static async Task SimulateProcessOrderAsync(object order)
    {
        await Task.Delay(10);
    }

    private const string SQL = @"
SELECT TOP 100
    *
FROM Sales.Orders 
WHERE 
    OrderID > @LastIdProcessed
ORDER BY OrderID";
}
Enter fullscreen mode Exit fullscreen mode

About this job:

  1. Loads all orders from the database into memory in chunks of 100
  2. For each item in a chunk/batch, it does some fake processing that takes 10 milliseconds
  3. The code stores each Task from the processing method and awaits them all at the end for increased performance

Note: You might have noticed that the SQL query I'm executing is grabbing data from all the columns. This is to perform a more realistic query, even though our code doesn't use any column except OrderId.

For this job to execute once, on my machine (Ryzen 7 4800), takes about 12 seconds when profiling.

Profiling the memory usage we get about 32.9 MB of usage:

memory usage

Increasing Batch Sizes

12 seconds is too long! We need this to be within our SLA of 5 seconds to process all pending orders (we're a busy business).

The next step we can take is to increase our batching size from 100 to let's say 5000 👀. This should reduce the amount of database calls we need to make.

Doing this brings our processing time down to about 2 seconds.

In the real world, if we are sending 5000 emails at the same time over the network a few "bad" things can occur:

  • We start getting an increase in overall latency in our network
  • Our email provider starts rate-limiting our process

However, our (fake) email provider is pretty super so we don't get any issues here.

The results:

  • Processing time is between 500 to 1000ms
  • RAM usage is up to about 45MB

Adding More CRON Jobs

Over time more CRON jobs have been added to the application.

host.Services.UseScheduler(scheduler =>
{
    scheduler.Schedule<ProcessAllOrdersInvocable>()
        .EverySeconds(5)
        .PreventOverlapping(nameof(ProcessAllOrdersInvocable));

    scheduler.Schedule<ProcessAllCitiesInvocable>()
        .EverySeconds(5)
        .PreventOverlapping(nameof(ProcessAllCitiesInvocable));

    scheduler.Schedule<ProcessAllInvoicesInvocable>()
        .EverySeconds(5)
        .PreventOverlapping(nameof(ProcessAllInvoicesInvocable));

    scheduler.Schedule<ProcessAllStockItemTransactionsInvocable>()
        .EverySeconds(5)
        .PreventOverlapping(nameof(ProcessAllStockItemTransactionsInvocable));
});
Enter fullscreen mode Exit fullscreen mode

Some of these are hitting bigger database tables now:

bigger table

Let's try to run all CRON jobs each with a batch of 5000 items assuming each record or item takes 10ms to "process" (whether that's sending emails or whatever).


Not bad. But, I'm running on a powerful laptop. What if we ran this in a docker container with limited resources?

Real-World Scenario: Running In Limited Docker Containers

I created a basic docker file:

FROM mcr.microsoft.com/dotnet/runtime:6.0 AS base
WORKDIR /app

FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["Basic/Basic.csproj", "Basic/"]
RUN dotnet restore "Basic/Basic.csproj"
COPY . .
WORKDIR "/src/Basic"
RUN dotnet build "Basic.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "Basic.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "Basic.dll"]
Enter fullscreen mode Exit fullscreen mode

I've configured my application to run with run options --memory="200m" --cpus="1". This is probably a more realistic scenario.

To enforce this, I created a docker-compose.yaml file to set resource limits:

version: "3.3"
services:
  dotnet:
    build:
      context: ..
      dockerfile: ./Basic/Dockerfile
    deploy:
      resources:
        limits:
          cpus: "1"
          memory: 200M
Enter fullscreen mode Exit fullscreen mode

Note: My connection string also needs to reference "host.docker.internal" instead of "localhost" now.

So... what are the results?

In Coravel's default configuration, jobs are executed one after the other. So, the total time it takes Coravel to process these 4 is the sum of all the logged values: over 6 seconds.

This isn't meeting our SLA!

Coravel Schedule Workers

The reason why Coravel runs CRON jobs one by one is so that it doesn't hog extra threads. In Web applications, this is a really good thing!

Coravel has a feature to dedicate more threads in order to isolate and parallelize jobs. This is called Schedule Workers:

scheduler.Schedule<ProcessAllOrdersInvocable>()
    .EverySeconds(5);

scheduler.Schedule<ProcessAllCitiesInvocable>()
    .EverySeconds(5);

// Dedicating a separate thread for this job.
scheduler.OnWorker(nameof(ProcessAllStockItemTransactionsInvocable));
scheduler.Schedule<ProcessAllStockItemTransactionsInvocable>()
    .EverySeconds(5);

// Dedicating a separate thread for this job too.
scheduler.OnWorker(nameof(ProcessAllInvoicesInvocable));
scheduler.Schedule<ProcessAllInvoicesInvocable>()
    .EverySeconds(5);
Enter fullscreen mode Exit fullscreen mode

Typically, schedule workers are useful for isolating longer-running tasks so they don't hold up other shorter-running tasks. But, does this help if we give one of the jobs a dedicated thread when it runs?

Keeping in mind that most of these jobs are running in parallel, the total time it took to run the 4 jobs is the job that took the longest: ~3.7 seconds.

However, we're seeing the time it takes individual jobs actually go up a bit. It might be that the overhead of having to manage multiple threads to do work is being throttled/limited by our container's measly 1 CPU allocation.

More Power!

Bumping the configuration to 2 CPUs gives us:

The 3 first jobs are definitely running faster. However, that last job isn't. It might be limited by database resourcing or it still might not have enough CPU power.

To verify, let's bump the CPU limit to 16 (my machine's total logical CPU count):

No material change here.

If we were okay with the results we have, the next steps might be to play around with the code of the ProcessAllInvoicesInvocable. We could try to change how large the batches of data it fetches from the database are, verify and optimize database indexing, etc.

Distributed Processing

We need more processing power. But, our organization only allows us to configure up to 1 docker CPU per running container! (Pesky governance!)

What if we could split the processing of scheduled jobs across multiple running processes though?

There's a great library for using distributed locks called DistributedLock. Go figure.

There's a specific package that supports locking with SQLServer that we'll start with.

Our invocables now look like this:

public async Task Invoke()
{
    var @lock = new SqlDistributedLock(nameof(ProcessAllCitiesInvocable), this.connectionString);
    await using var handle = await @lock.TryAcquireAsync();
    if (handle != null)
    {
        // do stuff
    }
}
Enter fullscreen mode Exit fullscreen mode

I also changed docker-compose-yaml to run 3 instances of this project. Here are the results where you can see that jobs aren't overlapping and are handled by different instances:

The benefit of this approach, much like using Coravel's schedule workers, is that if one job is taking too long to process it won't cause other jobs to wait for it. Another instance/process will pick up those other jobs.

It also looks like we have a bottleneck with ProcessAllInvoices.

This bottleneck on my machine could be due to disk speed, memory speed, CPU, or something else. This might just be another reminder that the database is such an important piece of your performance profile!

A Potential Issue

There's one issue to consider though. If you absolutely need to ensure that a specific job only runs once every 5 seconds, this approach won't work. Look at the screenshot above closely to see this - some jobs are executed multiple times within a span of 5 seconds even though different instances are running the job.

  1. Let's say your job X takes 500 ms to run. On Instance 1 it runs at 12:00:00 p.m.
  2. Instance 2 also sees that job X is due at the same time. So instance 2 starts to run a batch of all the jobs that are due.
  3. Instance 2 runs a job Y first, which takes about 600 ms.
  4. Then, at 12:00:60 p.m., it tries to get a lock on job X.
  5. It gets a lock since job X on instance 1 finished running 100 ms ago.

If you don't need this kind of exact timing guarantee within a small time span of work, then this distributed lock approach might work.

Decoupling Distributed CRON Jobs

The way that we've architected our CRON jobs might work for most teams. But, one of the other downsides is that we've coupled our scheduling logic with our job logic. Sometimes that's okay.

It also has the concurrency issue mentioned in the section above.

If we wanted to scale this solution so that other teams in our organization could use it, it wouldn't work. Other teams shouldn't be able to add their own invocable/job logic in our code. Our code would become a dumping ground!

Separate Scheduling Logic From Job Logic

The next step is to separate the scheduling logic from the CRON job logic. If we were using Coravel, then we'd make Coravel push an asynchronous message to a message broker.

We'd have to determine what the lowest interval or "tick" is. Every second? 5 seconds? Minute (like normal CRON in an OS)?

Coravel would send a message every interval X with the exact time that it ran.

Every team/consumer would get a message each interval that they should execute their job logic based on what time it is. But each team now owns that code.

Alternately, this approach could also help with the issue we saw where different instances executed the same job multiple times. Now, we're only triggering "ticks" once across all distributed systems and could leverage the messaging technology to only send each tick to one consumer in a round-robin fashion (or some other load-balancing technique).

An Alternate Way To Measure Performance

Let's go back to the beginning. Our assumption was that our entire batch of jobs had to run within 5 seconds.

One other way to measure the performance of the 3 different approaches we looked at (serial, multi-threads & multi-instance) is to calculate performance based on "database records processed per second".

What if our SLA wasn't "process 1 batch in 5 seconds" but "process X number of records every X seconds"?

So, I made the changes needed. For example, here's the extra job to output the value:

scheduler.Schedule(() =>
    {
        Console.WriteLine("### Total records processed: " + TotalRecordsProcessed.Value);
    })
    .EverySecond();
Enter fullscreen mode Exit fullscreen mode

Each job adds the number of saved records to the database using Interlocked.

I ran these using the same docker-compose files created across each project with a CPU limit of "2" for each instance.

Basic Scheduler Results

### Total records processed: 10000
### Total records processed: 30000
### Total records processed: 50000
### Total records processed: 85510
### Total records processed: 415772
### Total records processed: 438712
### Total records processed: 453712
### Total records processed: 478712
### Total records processed: 574222
### Total records processed: 847424
Enter fullscreen mode Exit fullscreen mode

We have an average of 84,742 records per second (847,424 records / 10 runs).

Schedule Workers Results

### Total records processed: 80000
### Total records processed: 273595
### Total records processed: 281535
### Total records processed: 393202
### Total records processed: 413202
### Total records processed: 637307
### Total records processed: 685247
### Total records processed: 801914
### Total records processed: 821914
### Total records processed: 837424
Enter fullscreen mode Exit fullscreen mode

We have an average of 83,742 records per second (837,424 records / 10 runs).

Distributed Instances

For these results, we have to take 10 seconds of activity across 3 nodes.

distributed-dotnet3-1  | ### Total records processed: 15000
distributed-dotnet1-1  | ### Total records processed: 20000
distributed-dotnet2-1  | ### Total records processed: 15000
distributed-dotnet3-1  | ### Total records processed: 92940
distributed-dotnet1-1  | ### Total records processed: 93595
distributed-dotnet1-1  | ### Total records processed: 111535
distributed-dotnet2-1  | ### Total records processed: 30000
distributed-dotnet1-1  | ### Total records processed: 111535
distributed-dotnet3-1  | ### Total records processed: 227940
distributed-dotnet2-1  | ### Total records processed: 50000
distributed-dotnet3-1  | ### Total records processed: 334607
distributed-dotnet1-1  | ### Total records processed: 111535
distributed-dotnet3-1  | ### Total records processed: 348202
distributed-dotnet2-1  | ### Total records processed: 95510
distributed-dotnet2-1  | ### Total records processed: 108450
distributed-dotnet2-1  | ### Total records processed: 108450
distributed-dotnet3-1  | ### Total records processed: 448202
distributed-dotnet1-1  | ### Total records processed: 195130
distributed-dotnet1-1  | ### Total records processed: 223070
distributed-dotnet2-1  | ### Total records processed: 108450
distributed-dotnet1-1  | ### Total records processed: 228070
distributed-dotnet3-1  | ### Total records processed: 548202
distributed-dotnet2-1  | ### Total records processed: 108450
distributed-dotnet1-1  | ### Total records processed: 243070
distributed-dotnet3-1  | ### Total records processed: 654869
distributed-dotnet3-1  | ### Total records processed: 658464
distributed-dotnet2-1  | ### Total records processed: 108450
distributed-dotnet3-1  | ### Total records processed: 658464
distributed-dotnet1-1  | ### Total records processed: 263070
distributed-dotnet2-1  | ### Total records processed: 146390
Enter fullscreen mode Exit fullscreen mode

We have an average of 106,810 records per second.

Cool! But is that because we are actually running more tasks than we should? (Remember the locking issue? The answer turns out to be "yes" 😅)

I tried again by changing the code so that each instance only runs explicitly 1 or 2 jobs (e.g. no sharing across instances):

distributed-dotnet1-1  | ### Total records processed: 0
distributed-dotnet2-1  | ### Total records processed: 45000
distributed-dotnet3-1  | ### Total records processed: 5000
distributed-dotnet1-1  | ### Total records processed: 20000
distributed-dotnet2-1  | ### Total records processed: 155000
distributed-dotnet3-1  | ### Total records processed: 15000
distributed-dotnet1-1  | ### Total records processed: 87940
distributed-dotnet2-1  | ### Total records processed: 236667
distributed-dotnet3-1  | ### Total records processed: 30000
distributed-dotnet1-1  | ### Total records processed: 111535
distributed-dotnet2-1  | ### Total records processed: 236667
distributed-dotnet3-1  | ### Total records processed: 45000
distributed-dotnet1-1  | ### Total records processed: 111535
distributed-dotnet2-1  | ### Total records processed: 236667
distributed-dotnet3-1  | ### Total records processed: 60000
distributed-dotnet1-1  | ### Total records processed: 111535
distributed-dotnet2-1  | ### Total records processed: 351667
distributed-dotnet3-1  | ### Total records processed: 70510
distributed-dotnet1-1  | ### Total records processed: 214475
distributed-dotnet2-1  | ### Total records processed: 473334
distributed-dotnet3-1  | ### Total records processed: 70510
distributed-dotnet1-1  | ### Total records processed: 223070
distributed-dotnet2-1  | ### Total records processed: 473334
distributed-dotnet3-1  | ### Total records processed: 70510
distributed-dotnet1-1  | ### Total records processed: 223070
distributed-dotnet2-1  | ### Total records processed: 473334
distributed-dotnet3-1  | ### Total records processed: 70510
distributed-dotnet1-1  | ### Total records processed: 223070
distributed-dotnet2-1  | ### Total records processed: 473334
distributed-dotnet3-1  | ### Total records processed: 70510
Enter fullscreen mode Exit fullscreen mode

That's roughly about 76,690 records per second.

While not super scientific, using 1 .NET instance with Coravel was able to perform just as well as using multiple instances.

Keep in mind that we are measuring overall throughput across our entire processing with this measurement vs. the performance of an entire batch that we expect to run within 5 seconds.

Conclusion

The conclusions of our not-super-scientific-but-fun experiment are:

  • Changing the size of batches that you fetch from the database can dramatically improve the performance of an individual CRON job 💰
  • Even though we didn't 100% verify (in my experience this is often true): database performance is often a bottleneck 🍾
  • 1 process using Coravel to run your CRON jobs is pretty efficient compared to running multiple processes 🚀
  • Only distribute work if you know it's going to help as it can introduce unseen issues. Make sure you understand what's going on. Distributed systems are hard 🙄
  • For high-performance CRON processing across teams, you should decouple scheduling logic from job logic 💻

Check out the repository with code that you can play around with. Give Coravel a try in your own projects if you haven't yet!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .