Mojo🔥: Head-to-Head with Python and Numba

Maxim Saplin - Sep 27 '23 - - Dev Community

Earlier this month, Mojo SDK was released for local download. There was a lot of buzz about how it can speed up Python by 35,000x or even 68,000x.

Around the same time, I discovered Numba and was fascinated by how easily it could bring huge performance improvements to Python code. Read this great article to learn more about Numba.

If we put aside employing SIMD extensions (SSE, MMX, etc.) and locking into a specific CPU architecture/making changes to existing code; don't test perfectly parallelizable tasks on a high-end server with 88 cores and hyper-threading... Will Mojo be 68,000 times faster?

Getting Acquainted with Mojo🔥

My first takeaway - Mojo is a completely different language. You can't start typing standard Python in a .mojo file and expect it to be compilable by Mojo right away.

Second takeaway - there are no arrays/lists :) I had to create my own implementation:

struct Array[T: AnyType]:
    var data: Pointer[T]
    var size: Int

    fn __init__(inout self, size: Int):
        self.size = size
        self.data = Pointer[T].alloc(self.size)

    fn __getitem__(self, i: Int) -> T:
        return self.data.load(i)

    fn __setitem__(self, i: Int, value: T):
        self.data.store(i, value)

    fn __moveinit__(inout self, owned existing: Self):
      self.data = existing.data
      self.size = existing.size

    fn __del__(owned self):
        self.data.free()
Enter fullscreen mode Exit fullscreen mode

3rd takeaway - there's a Rust-like ownership memory management model (no Garbage Collector). I had to use the ^ operator and implement Array.__moveinit__() to allow the array created in a function to be returned:

fn mandelbrot() -> Array[Int]:
    let output = Array[Int](width*height)

    for h in range(height):
        let cy = min_y + h * scaley
        for w in range(width):
            let cx = min_x + w * scalex
            let i = mandelbrot_0(ComplexFloat64(cx, cy))
            output[width*h+w] = i
    return output^ # transfer ownership
Enter fullscreen mode Exit fullscreen mode

4th - there's an out-of-the-box interop with Python, though you need some manual imports before you can touch .py files or use PiPy modules.

5th - it's still in early development and many things are still missing (docs, types, community), and it works only on Intel CPUs with Linux. Yet it is very fast :)

Micro-benchmarking

I took the Mandelbrot example from the Mojo blog and reimplemented it in 4 flavors:

  1. mandelbrot.py - baseline implementation
  2. mandelbrot.🔥 - Python translation into Mojo. It was not as straightforward as expected.
    • No arrays, Tensor structure can't be used as an alternative as long as you don't want to use SIMD
    • Different type names, i.e., Int is written with an uppercase I
    • Non-standard print()
    • Had issues interoperating with Python's time, used Mojo's alternative
    • The let keyword creates immutable variables
    • Ownership/transferring return value via ^
  3. mandelbrot_numba.py - all I did was cloning the Python file, importing Numba and putting 2 @njit decorators on functions
  4. mandelbrot_numba_prange.py - although I didn't intend to add parallelization to this mixture, it was so easy with Numba (way fewer steps than with Mojo), I couldn't resist. I simply changed decorators (parallel=True) and replaced 2 range statements with prange

I tested these files (python3 'file name' and mojo 'file name') on the same Linux VM:

  • Ubuntu 20.04.3 LTS, 64-bit, Intel Core i5-8257U @ 1.4GHz x 2, VMWare Workstation Player 17.0.1

Results

Language/variant Time (seconds) x Baseline
Python 10,8 x 1.0
Python + Numba 0,68 x 15.9
Python + Numba (fastmath)* 0,64 x 16.9
Python + Numba (prange) 0,38 x 28,4
Mojo * 0,32 x 33,8

(*) When checking the produced results, I noticed minor discrepancies in the produced set. Apparently, different float capabilities are used by Mojo and Numba with the fastmath flag. For some corner cases, roundings/comparisons were different for the same set of input params.

Conclusion

I had this misconception about Mojo that it can be an in-place substitute for Python, i.e., if you had some Python code that you wanted to accelerate, copy and paste it to a Mojo project, and you are covered. That turned out not to be the case.

At the same time, Numba is the right tool for quick wins in performance with an existing Python code base.

UPDATE 1: Numba beats Mojo!

As pointed out by my colleague, Python implementation used NumPy which is a bit of cheating. And he suggested to use Python without 3rd parties...

This pure version gave 0,29 seconds without utilising Numba's parallelisation and 0,19 seconds with prange(). This is the best result!

UPDATE 2: Custom Complex class in Python

As pointed by a different person, while NumPy and Mojo used a separate class/struct for handling complex numbers, the variant without NumPy operated on two numbers (real and imaginary parts) directly. That is clearly an optimisation trick and doesn't add up to code readability.

Hence the 3rd version of Python implementation. This time it defines a Complex class with 3 operations and uses it's instances to handle complex numbers' math.

And here're the numbers:

  • Pure Python ~27 minutes (1672 second)
  • Numba, @njit() - 8,3 seconds (200x boost)
  • Numba, @njit() and prange() - 4,3 seconds
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .