Efficient Dart: optimizing CPU-bound load in Flutter without missing a frame

Maxim Saplin - Sep 26 '23 - - Dev Community

Last week, I created a Flutter app to run the Julia set generator on a gRPC Python server. This sample project discusses Flutter and Python integration. I had a prejudice towards Python being super slow. Surprisingly, by using Numba and making few alterations (@njit(parallel=True) and prange()) to the original Python code, I achieved a ~350x performance boost measured in frames-per-second (FPS) in the UI. All happening over gRPC with serialization/deserialization overhead between Flutter and Python processes.

Python, Numba, Flutter performance diffs, Julia set generation

What surprised me even more was implementing a Dart analogue to the Python code. I expected the state-of-art Dart and Flutter to be fast, but I found it to be three times slower in terms of FPS than Python, running natively in-process, with noticeable UI stutters.


Can we optimize Dart's performance for this CPU-bound task and beat Python/Numba? Let's find out!

About the Task

The Julia set is a fractal closely related to the Mandelbrot set. Calculating Julia/Mandelbrot sets can stress a CPU and are used in many benchmarks, such as Mojo. I chose the Julia set to create animated fractals by changing one of the formula parameters.

Julia set example

I used the escape algorithm to calculate the set for a given region of points. For each (x, y) coordinate representing a point on a complex plane, we iterate in a loop making up to a threshold number of iterations (default is 100), calculating if the point belongs to the set or not (diverges). For each point, we can store the number of iterations it took to verify it, and then use it to color and render the pixel. Essentially, we create a method that can generate a widthxheight pixel map with each pixel described by the number of loop iterations it took to validate the (x, y) number.

App Overview

You can find the Git repo here. The Flutter project is located in the app/ folder. The rest of the folders can be ignored (part of Python/gRPC example):

app/
|-- grpc_generated/
|-- julia_native/
|-- julia_service.dart
|-- main.dart
Enter fullscreen mode Exit fullscreen mode

1) main.dart contains the UI and rendering logic, using the RawImage widget to display the Julia set received from a stream provided by...
2) julia_service.dart, that offers an abstract interface to different generators.
3) julia_native/ contains Dart implementation variants of Julia set generator reviewed below.

Now let's examine the six different Dart implementation variants of the Julia set generator and evaluate their performance improvements based on provided test results.


V1: A Direct Python-to-Dart Translation

Variant 1 is a straightforward conversion of this Python code. It runs in main UI thread and serves as our baseline for performance measurements. The average FPS achieved by this variant is 37.4.

First, we start with implementing some of the missing parts needed for the calculation (recreating Complex number and linespace from NumPy):

// julia_native/v1.dart

class _Complex {
  double real;
  double imag;

  _Complex(this.real, this.imag);

  _Complex operator +(_Complex other) =>
      _Complex(real + other.real, imag + other.imag);
  _Complex operator *(_Complex other) => _Complex(
      real * other.real - imag * other.imag,
      real * other.imag + imag * other.real);

  double abs() => sqrt(real * real + imag * imag);
}

List<double> _linspace(double start, double end, int num) {
  double step = (end - start) / (num - 1);
  return List<double>.generate(num, (i) => start + (step * i));
}
Enter fullscreen mode Exit fullscreen mode

Next we do the calculation:

List<int> _getSetAsHeightMap(
    int widthPoints, int heightPoints, int threshold, double position) {
  List<int> result = List<int>.filled(widthPoints * heightPoints, 0);
  double width = 4, height = 4 * heightPoints / widthPoints;
  double xStart = -width / 2, yStart = -height / 2;

  List<double> re = _linspace(xStart, xStart + width, widthPoints);
  List<double> im = _linspace(yStart, yStart + height, heightPoints);

  double r = 0.7;
  double a = 2 * pi * position;
  double cx = r * cos(a), cy = r * sin(a);

  for (int i = 0; i < heightPoints; i++) {
    for (int j = 0; j < widthPoints; j++) {
      result[i * widthPoints + j] =
          _checkInJuliaSet(re[j], im[i], cx, cy, threshold);
    }
  }

  return result;
}

int _checkInJuliaSet(
    double zx, double zy, double constX, double constY, int threshold) {
  _Complex z = _Complex(zx, zy);
  _Complex c = _Complex(constX, constY);

  for (int i = 0; i < threshold; i++) {
    z = z * z + c;
    if (z.abs() > 4.0) {
      return i;
    }
  }

  return threshold - 1;
}
Enter fullscreen mode Exit fullscreen mode

The _getSetAsHeightMap() returns a flattened widthPointsxheightPoints matrix with number of iterations per each point.

The _checkInJuliaSet() function evaluates a specific (x,y) point. From the above code you can see why this computation can be CPU intensive, there're 3 nested loops. E.g for an area of 1000x1000 pixels and average iteration per point being 50 we can expect 50 million calculations to be conducted per each pixel!

The problem with this implementation is that UI becomes unresponsive. I.e. hovering over buttons or clicking shows significant delays in reaction.

Dart Isolates

I will be using isolates in order to:

  • Fix the UI stuttering by doing as little as possible in the main UI tread and moving the heavy work to isolates.
  • Put to work all of our CPU cores, Julia set generation is a task that can be perfectly split, each individual row and pixel value can be computed independently.

There're no Threads in Dart, parallelism is achieved via isolates. In simple terms they are backed by OS threads and can run in parallel but when used in Dart they do not share memory (i.e. you can't directly access variable in different isolate) and communicate via messages - similar to Web Workers in JS.

Isolates are a middle ground between concurrency without parallelism (as in Python, JS/TS and Dart with one isolate) and parallelism with shared memory and threads (as in C/C++, C#, Java). At a cost of overheads in cross-isolate communication (all objects must be copied by value, in some cases serialized/deserialized) Dart provides a more robust model less prone to tricky bugs (race conditions, dead locks, shared state corruption etc.).

By default Dart provides 2 APIs to work with isolates:

  • Isolate class - a lower level API that requires time and effort to start using isolates, provides basic means, such as ports, streams and leaves it to developer to decide and implement
  • Flutter's compute() utility method that allows to spin up a new isolate, pass in a message and callback and receive the result back.

In this example I'll be using isolate_pool_2 package that is a wrapper around Dart's Isolate API and has a few handy features:

  • It creates a pool of isolates ready to accept requests. No time wasted to start isolate upon each request, and we will have many.
  • It balances workload between the pool isolates. As soon one isolate is done another item from the que of requests is taken for execution
  • It has a notion of PooledJob that allows to define the request to isolate as a class containing all required data and operation that will be executed in isolate and returned back.

V2: Parallelization using isolates

By utilizing 10 isolates, V2 achieves an average FPS of 97.1, which is a significant improvement compared to the baseline performance.

// julia_native/v2.dart

// Skipping imports and _canceled, cancelSetGeneration(), and _position for brevity

late IsolatePool _pool;

Future<List<int>> _getSetAsHeightMapParallel(
    int widthPoints, int heightPoints, IsolatePool pool, int threshold) async {
  // ...
}

class GetBlockJob extends PooledJob<List<int>> {
  // ...

  @override
  Future<List<int>> job() async {
    List<int> result = List<int>.filled(widthPoints * im.length, 0);
    // ...
    for (int i = 0; i < im.length; i++) {
      for (int j = 0; j < widthPoints; j++) {
        result[i * widthPoints + j] =
            _checkInJuliaSet(re[j], im[i], cx, cy, threshold);
      }
    }

    return result;
  }
}
Enter fullscreen mode Exit fullscreen mode

In V2, the Julia set generation is parallelized using the IsolatePool library, and the same logic from V1 is wrapped inside a GetBlockJob class. Since we can independently work on several parts of the image, we can assign different "stripes" to be created by jobs and then merged into one list:

Future<List<int>> _getSetAsHeightMapParallel(
    int widthPoints, int heightPoints, IsolatePool pool, int threshold) async {
  var list = List.filled(widthPoints * heightPoints, 0);

  double width = 4, height = 4 * heightPoints / widthPoints;
  double xStart = -width / 2, yStart = -height / 2;
  List<double> im = _linspace(yStart, yStart + height, heightPoints);  

  int blockSize = (heightPoints / pool.numberOfIsolates).ceil();

  List<Future> futures = [];

  for (var i = 0; i < heightPoints; i += blockSize) {
    futures.add(pool
        .scheduleJob<List<int>>(GetBlockJob(
            widthPoints: widthPoints,
            heightPoints: heightPoints,
            width: width,
            height: height,
            xStart: xStart,
            yStart: yStart,
            im: im.sublist(i, min(i + blockSize, heightPoints)),
            threshold: threshold,
            position: _position))
        .then((v) {
      list.setAll(i * widthPoints, v);
    }));
  }

  await Future.wait(futures);
  return list;
}
Enter fullscreen mode Exit fullscreen mode

This implementation solves UI stutters/freezes of V1 since there's no heavy lifting happening in the UI thread.

V3: Using Uint8List instead of List

Let's carefully look at the kind of data returned by our isolate job:

 @override
  Future<List<int>> job() async {
Enter fullscreen mode Exit fullscreen mode

We return a list of integers which by default are 64 bit/8 bytes.

On the other hand each point in our Julia set is never described by a number larger than 255, there's even an assertion in julia_service.dart that doesn't allow greater threshold values:

  assert(iterationThreshold >= 1 && iterationThreshold <= 255,
      'Threshold must be between 1 and 255');
Enter fullscreen mode Exit fullscreen mode

Though you can't have 1 byte integers in Dart, or can you?

Luckily Dart has a special typed_data library that has a number of intrinsics that help to efficiently move bytes around. Let's swap List<int> for UintList8 everywhere in the file, like this:

// julia_native/v3.dart

Future<Uint8List> _getSetAsHeightMapParallel(
    int widthPoints, int heightPoints, IsolatePool pool, int threshold) async {
  var list = Uint8List(widthPoints * heightPoints);
...
Enter fullscreen mode Exit fullscreen mode

Please note that external clients of our code expect to receive List<int>, and luckily Uint8List can be implicitly casted, so there's no need to change anything in getSetAsHeightMapAsBytesStream() method.

Besides Uint8List being more compact and saves CPU cycles on marshalling note that it can be faster to initialize. E.g. List<int> requires you to initialise it with values when creating a fixed length list, Uint8List constructor doesn't require a default value.

By this minor change we gain almost 400% in performance! 178,5 fps vs baseline 37,4!

V4: Improving Single-Threaded Performance

Let's step back and see if we can make some gain in our algo implementation.

We can get rid of _Complex class and work on individual coordinates, skip sqrt() operation (we can check escape condition with by comparing squared sum to 16, rather than a square root to 4), implement few optimisations to the calculation in the loop (use zx0zy0):

// julia_native/v4.dart

// Skipping imports for brevity

int _checkInJuliaSet(
    double zx, double zy, double constX, double constY, int threshold) {
  var zx0 = zx;
  var zy0 = zy;

  for (int i = 0; i < threshold; i++) {
    final zx0zy0 = zx0 * zy0;
    zx0 = zx0 * zx0 - zy0 * zy0 + constX;
    zy0 = zx0zy0 + zx0zy0 + constY;
    if ((zx0 * zx0 + zy0 * zy0) > 16) {
      return i;
    }
  }

  return threshold - 1;
}

Enter fullscreen mode Exit fullscreen mode

This optimization achieves an average FPS of 47.2 in single-threaded mode.

While it might not be as nice looking as the one using _Complex class, if it achieves better performance in the critical part, why not.

V5: Combining Optimizations from V3 and V4

Variant 5 incorporates the improved calculations from V4 into the parallelized V3 implementation. It achieves 198,1 fps(vs 178,5 fps).

V6: Optimized Work Distribution among Isolates

You could notice that frames with many red pixels take longer to compute. That's because these frames require more loops inside _checkInJuliaSet() since escapes take longer (sometimes reaching the threshold value).

When splitting work between isolates some might get regions with fewer points that require a lot of iterations, some will got more of those. Thus it will take longer from overloaded isolates to return their part which latter will e merged into end results. To overcome this uneven load we can break down the areas given to isolates even further, i.e. instead of 10 areas given to 10 isolates we can give them 40 smaller stripes and in this case it is less likely that one isolate will be working hard on large one computation intensive stripe.

In V6, the primary change from V5 is the new blockSize calculation that incorporates a 4x factor for distributing workload more evenly among isolates.

// julia_native/v6.dart


Future<List<int>> _getSetAsHeightMapParallel(
    int widthPoints, int heightPoints, IsolatePool pool, int threshold) async {
  // ...
  int blockSize = (heightPoints / (pool.numberOfIsolates * 4)).ceil();
  // ...
}

// Rest of the code remains the same as V5
Enter fullscreen mode Exit fullscreen mode

We get further jump from 198.1 to 219.1 fps. By fine tuning the number of isolates and split coefficient we seek for further improvements. I tried 9 isolates and x3 split and it worked batter than 10/x4 - 225.9 fps.

Measurement and Assumptions

I use frames-per-second (FPS) as a measure. The FPS counter is displayed at the top left corner of the window. To start the measurement, double-click the Play button, and the animation will cycle three times, with the average FPS registered.

Numbers are provided for Flutter 3.13.5, macOS 13.6, release build; MacBook Pro with M1 Pro CPU (8 performance and 2 efficiency cores).

FPS shows how many Julia sets (pixel maps flattened as List<int>) are received in a given second in Stream.listen event. The FPS counter includes the calculations of Julia set, widget build times, and async/stream/list creation overhead, among other factors. It does not measure the pure Julia loop execution time but loads the entire Flutter pipeline assuming that most of the CPU time is spent on Julia set generation.

Rendering/rasterization is ignored as a factor since it is assumed to be fast and takes place in a separate rasterization thread. As seen in the Dart Dev Tools Performance page, almost no frames exceed 1ms raster time, indicating that most CPU work occurs in the UI thread (computations, creating widgets, allocating lists, etc.).

Image description

Conclusion

By changing the original implementation we were able to get 6x performance gain and recover UI responsiveness!

V1 vs V6 comparison

We have tried six different Dart implementation variants (10 isolates by default):

Variant Average FPS Description
V1 37.4 Baseline, direct Python-to-Dart translation
V2 97.1 Parallelization using IsolatePool
V3 178.5 Using Uint8List instead of List
V4 47.2 Improved single-threaded calculations
V5 198.1 Combining optimizations from V3 and V4
V6 219.1 Optimized work distribution among isolates
V6 tuned 225.9 9 isolates (vs 10), x3 split (vs x4)

And a graph:

Image description

The test results show that significant performance improvements can be achieved by tinkering with your code and even small gains matter (especially when they are nested in loops).

The principles can be extrapolated to any Flutter/Dart app:

  • Off-load intensive computations to isolates
  • Think carefully about what data is shared between isolates, keep it small to minimise serialization/deserialization overhead
  • Utilise many isolates, balance workload between them
  • Tinker with your algorithm, especially when used in the loop
  • Use DevTools, profile your code

DevTools

In this article I've barely touched Dev Tools (available in VSCode when you have Dart/Flutter extension installed), though I used it to experiment and tinker with the code:

  1. Performance view to quickly evaluate frame rates and understand if there's problem in UI and rasterisation threads
  2. CPU Profiler - to find out where most of CPU time was spent
  3. Memory view - to diagnose degrading performance when running animation and pinpoint memory leak that I fixed.

Update, V7 with TransferableTypedData

I've added another 5fps in V7 (230 vs 225 in V6). The only difference is using TransferableTypedData returned from the isolate:

class GetBlockJob extends PooledJob<TransferableTypedData> {
...
Enter fullscreen mode Exit fullscreen mode
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .