Yet another post about the Battlesnake project I've been working on while diving in Crystal lang.

This is a quick continuation of yesterday's post about storing data in a DB. I mentioned why I took a performance hit and the solution implemented was to persist the data to DB from a background job. The idea is that enqueuing the jobs to Redis would ideally perform better than waiting for a DB write.

Again, FYI the code is on GitHub.

Mosquito

The Crytal Sidekiq port was a tempting option for a background task runner, but I went with mosquito-cr/mosquito. My initializer looks like this:

require "mosquito"

Mosquito.configure do |settings|
  settings.redis_url = (ENV["REDIS_URL"]? || "redis://localhost:6379")
end

Since initializers are required by src/app.cr (more about this on a previous post) I can now enqueue the job where I once persisted directly to DB. Snippets from all of this below:

# Replaces `Turn.create(...)`
PersistTurnJob.new(
  path: env.request.path,
  context_json: env.params.json.to_json
).enqueue

# src/jobs/persist_turn_job.cr

require "mosquito"

class PersistTurnJob < ApplicationJob
  params(path : String, context_json : String)

  def trace_perform
    context = BattleSnake::Context.from_json(context_json)
    dead = context.board.snakes.find { |s| s.id == context.you.id }.nil?
    turn = Turn.create(
      game_id: context.game.id,
      snake_id: context.you.id,
      context: context_json,
      path: path,
      dead: dead
    )
  end
end

In order to have OpenTelemetry tracing of the jobs I inherit from my ApplicationJob.

# src/jobs/application_job.cr

require "mosquito"

# Base class for jobs in the app. Overrides `perform` so that jobs can
# implement `trace_perform` instead. This will allow for OpenTelemetry tracing
# if available, otherwise the job will be executed as it would if it overrides
# `perform` (mosquito standard). If someone does override `perform` on the job
# it will also have no behavior effect, other than tracing not taking place.
abstract class ApplicationJob < Mosquito::QueuedJob
  def perform
    if ENV["HONEYCOMB_API_KEY"]?.presence
      OpenTelemetry.trace self.class.to_s do |span|
        span.kind = :internal
        trace_perform
      end
    else
      trace_perform
    end
  end
end

Worker

Until now the app produced only one executable, which was the API server. The worker that runs these background jobs is src/worker.cr file and looks like this:

require "./battle_snake/**"
require "./strategy/**"

require "dotenv"
Dotenv.load if File.exists?(".env")
require "./initializers/**"
require "./models/**"
require "./jobs/**"

Mosquito::Runner.start

It requires all dependencies and starts the mosquito runner, but how to work on it locally?

I have a sam.cr task in place that helps me with local develpment. I tweaked make sam dev to spin up two Sentry runners, and it continues to work seamlessly with both development executables + livereload (livecompile?) again.

Dockerfile

I use a Docker deployment on DigitalOcean's App Platform (previous post about this), so I need to change the Dockerfile so that both executables are compiled & included there. This allows to execute either one from the same image passing in the command to override the default ENTRYPOINT (reference).

It now looks like this:

# Build image
FROM crystallang/crystal:1.7.2-alpine as builder
WORKDIR /opt
# Cache dependencies
COPY ./shard.yml ./shard.lock /opt/
RUN shards install -v
# Build a binary
COPY . /opt/
RUN crystal build --static --release ./src/app.cr
RUN crystal build --static --release ./src/worker.cr
RUN crystal build --static --release ./src/money_hack.cr
# ===============
# Result image with one layer
FROM alpine:latest
WORKDIR /
COPY --from=builder /opt/app .
COPY --from=builder /opt/worker .
COPY --from=builder /opt/money_hack .
ENTRYPOINT ["./money_hack"]

It's now compiling src/app.cr, src/worker.cr, and... src/money_hack.cr? Well, I already mentioned I'm not trying to spend more money than needed on this project.

My solution was to run both the server and the worker on the same node/droplet. There are many disadvantages to this so I don't recommend this practice as a rule of thumb (hence the name src/money_hack.cr). Managing this with Kemal/Mosquito configs or independent horizontal scaling as needed are likely better solutions.

#! /usr/bin/env crystal
#
# Runs both the server and worker executables in separate fibers to avoid
# independent deployments. Motivation is saving costs
#

channel = Channel(Nil).new

spawns = ["./app -p 8080", "./worker"].map do |command|
  spawn do
    res = system command
    channel.send(nil) unless res
  end
end

channel.receive

Resource consumption & performance

First of all, there's now two executables running on the same droplet that used to only run the API server. These are the insights of the small instance running 512 MB RAM | 1 vCPU x 1 (last 7 days).

I marked in red the (approximate) regions to explain them better.

Region	Memory average	CPU average	Notes
1	13%	0%-1%	Pre-data persistance (no DB)
2	variable	variable	development (debug/errors/deployments)
3	13%	1%-2%	sync DB persistance (prev post)
4	variable	variable	development (debug/errors/deployments)
5	16%	1%-2%	Background job persistance

In region 1 there's ~13% memory usage and that remains unchanged during region 3. With the worker processing jobs region 4 did consume ~16% memory, which is very little difference IMO. CPU usage in region 1 was 0% and 1% at times (5min granularity). With both persitance implementations on region 3 and 5 it bumped up to 1% and 2% values.

All of this tells me that we could (in theory) refactor money_hack.cr to have many workers in parallel before getting close to maxing out the hardware capabilities. Definitely not necessary for now, specially since the bottleneck is DB/Redis hardware. Just a funny situation to think about.

Telemetry comparison

The COUNT, P50, P95, and P99 of traces across the last few days below

BEFORE DB PERSISTANCE

SYNC DB PERSISTANCE

ASYNC WORKER DB PERSISTANCE

That's great! ~2x better in P50, ~2.5x better in P95, and ~5x better in P99. Nothing compares to raw Crystal code execution with sub-millisecond response times, but this isn't bad in any way considering we have everything integrated in the project.

Below is the telemetry on the job execution too. An alternative would be aggregating/querying logs I guess, but with the existing integration it's an easy way to measure them too.

It makes sense to see the jobs perform similar to the sync DB persistance telemetry data. Reassures the assumption (widely known/used approach) that DB requests are slower than Redis in this scenario.

Conclusions

The app is starting to look sturdy to me, as in there are a few moving pieces now and it performs nicely (fast and reliable). It's also a joy to work with Crystal code.

I have one more post in mind for the series, at least for a while, so I'm excited about sharing that sometime soon.

Pura vida.

Background jobs for Kemal server in Crystal lang