Timing out synchronous functions with regex

SnykSec - Jun 21 '23 - - Dev Community

How hard can it be to support custom container image tags? Turns out… quite! I know this because my team has been busy at work on our new custom base image support for Snyk Container, and we were tasked with the following problem: Given a tag, parse its parts to be able to compare it to other similar tags. It was a fun problem to solve, and we'd love to share how we got to our final solution!

Regular expressions to the regex-scue?

Our idea was to use regex, specifically the named grouping feature. For example, if we want to match this:

1.82.5_2022_x64_final
Enter fullscreen mode Exit fullscreen mode

We could use this regex (apologies in advance to anyone that has the occasional regex nightmare):

/(?\d+)\.(?1d+)\.(?1d+)\_(?\d{4})\_(?.\*)(?.\*)/
Enter fullscreen mode Exit fullscreen mode

At this point some red flags 🚩 may have been raised by you. We all know regexes might be dangerous, and letting the user decide both the input string and the regex is just asking to get attacked!

A user could easily create a base image called:

"docker-exploit:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"
Enter fullscreen mode Exit fullscreen mode

And give us the string:

/(a+)+$/
Enter fullscreen mode Exit fullscreen mode

This would have caused our service to completely stop processing any other requests! No way any AppSec team would ever let someone do this.

Does that mean we need to go back to the drawing board? Thankfully not! There happens to be a library (regex parsing engine) specifically designed to prevent such attacks. The library is RE2, and is developed by Google. It's important to note that some regex features aren’t supported but they wouldn’t be used anyway. 

Now the fun begins!

We would rather not have our only safeguard be an external dependency, prone to bugs out of our control, and possible software supply chain attacks. Due to this, we wanted a way to time out regexes that take too long.

Since Node is single-threaded, it’s not easy to time out code that can’t run concurrently. Here’s the solution we came up with:

Threads 🧵!

This is probably the first thing that popped into your head — it was ours as well. In addition to being able to kill a thread, we get the added advantage of not blocking the main event loop!

Alright, let’s write some code:

Main thread

// Main thread
let worker = new Worker("./dist/regex-parsing/regex-worker.js");

async function runRegexSafely(tag: string, regex: RE2) {
 const threadedRegexMatch = new Promise((resolve) => {
   worker.once("message", resolve);
   worker.postMessage({ tag, regex }); // Send the string and regex to the worker
 });
 try {
   return await Promise.race([threadedRegexMatch, timeoutReject(timeoutMs)]);
 } catch (e) {
   await worker.terminate();
   worker = new Worker("./dist/regex-parsing/regex-worker.js");
   throw e;
 }
}

Enter fullscreen mode Exit fullscreen mode

Worker script

// Worker script
parentPort.on("message", ({ tag, regex }) => {
 parentPort?.postMessage(tag.match(regex));
});

Enter fullscreen mode Exit fullscreen mode

Pretty simple right? How about I make sure the code runs…


Hmmm, I’ve never seen this error before. Let’s see what MDN says:


And: may not contain native (C++ backed) objects

Turns out objects aren’t passed by reference, but cloned. And since RE2 is a function (class), which is just a binding to a C++ backed object, we can’t even pass it as an argument.

We could just create it inside the thread, and pass the regex as a string. This is another instance where a red flag 🚩 appeared. There might be a non-negligent effect on performance if we need to reconstruct the RE2 object for every match call.

new RE2(/(a+)+$/)
Enter fullscreen mode Exit fullscreen mode

Let’s not speculate, we can actually test it out. We used benchmark.js.

Benchmarking performance

⚠️ It’s not easy to run micro benchmarks like these. V8 is very smart with optimizations, so take these tests with a grain of salt. Here’s a great YouTube video about the topic of microbenchmarking

As a baseline, RE2.match can run 500k times/sec.

Since RegExp objects are clone-able, we can send it to the thread as is instead of RE2 and see what the performance is. Turns out running RegExp.match inside a worker thread (awaiting the result) can run 50k times/sec. A 90% decrease! 🤩

Creating the RE2 object in the thread results in an even worse 17k times/sec. A 97% performance penalty. 😭

Maybe threads aren’t the most optimal solution, and thankfully, there’s another way!

The vm module

Node has a module called vm that allows you to run scripts in a separate context. The advantage of this is that it comes with a built in timeout option.

Here’s some example code:

const vm = require('node:vm');

const context = {
  regex: '(?.\*)\\.(?.\*)',
 tag: 1.2
};

vm.createContext(context);
const script = new vm.Script("regex.exec(tag)");

return script.runInContext(context, {timeout: 1000});

Enter fullscreen mode Exit fullscreen mode

We don’t need to create the script and context every time we want to run a match, so we can just create aclosure and return it.

Final code

function getClosure() {
    const context = {
      regex: '(?.\*)\\.(?.\*)',
 tag: 1.2
 };

 vm.createContext(context);
 const script = new vm.Script("regex.exec(tag)");

 return function(regex, tag, timeout=0) {
 context.regex = regex;
 context.tag = tag;

 return script.runInContext(context, {timeout});
 }
}

const finalMatchFunction = getClosure();

Enter fullscreen mode Exit fullscreen mode

Let's benchmark again

Let’s run that in the benchmark and see the performance. Running the following inside the vm gave a much more acceptable 200k/sec. 👍

finalMatchFunction(RE2("(?.\*)"), "1")
Enter fullscreen mode Exit fullscreen mode

Oh wait! We forgot to add a timeout, so let’s just run the benchmark again with a timeout of 100 milliseconds:

finalMatchFunction(RE2("(?.\*)"), "1", 100)
Enter fullscreen mode Exit fullscreen mode

Shouldn’t have much of an impact since we never actually time out…

finalMatchFunction#withtimeout × 10,798 ops/sec ±2.07% (86 runs sampled)

Process finished with exit code 0
Enter fullscreen mode Exit fullscreen mode

Hmmmmmmmmmmmmmm😭😭😭

Why did adding a timeout that we never actually hit result in WORSE performance than the threaded approach!?

Some Google searches lead to a GitHub issue that lead to a PR that lead to a note, hidden at the bottom of the documentation page stating:

⚠️ Using the timeout or breakOnSigint options will result in new event loops and corresponding threads being started, which have a non-zero performance overhead.

Argghh!

Conclusions

Due to the fact that a worker thread is more complicated (harder to read, needs more resources, requires all functions that use it to change their signatures to async) than the vm solution, and they have similar performance, the vm solution was chosen. Optimizations can be made if it becomes a bottleneck in the future.

Now that I've explained how we went about implementing this functionality, we'd love to have you give our new custom base image recommendations a try. If you like it, let us know @snyksec on Twitter.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .