Motion-controlled Fruit Ninja game using Three.js & Tensorflow.js

Charlie Gerard - May 11 '20 - - Dev Community

Over the past few weeks, I've spent some time building a clone of the Fruit Ninja game, you can play with hand movements, using web technologies.

Demo:

Demo

Feel free to check the live demo

In this post, I'm gonna go through the steps I took to build it, the tools I used and the different challenges I encountered.

(If you're more into videos, I made a quick video tutorial)


Step 1: Breaking the problem down

The first thing I always do when I come up with an idea, is spend some time figuring out how to break it into smaller pieces.
This way, it allows me to identify parts of the project I might already know how to build, where I need to do some extra research, identify the different tools I need to use based on the features, and finally, have a rough idea of the timeframe needed to build it.

For this particular project, I ended up with the following parts:

1) Get the pose detection working
2) Set up the 3D scene
3) Add 3D objects
4) Map the 2D hands movements to the 3D world
5) Creating the hand trail animation
6) Add collision detection between the hands and 3D objects
7) Add the game logic (points, sounds, game over, etc...)
8) Refactor
9) Deploy


Step 2: Picking the tools

Now that the project is broken down into independent chunks, we can start thinking about the tools we need to build it.

Pose detection

I knew I wanted to be able to detect hands and their position.
Having worked with the PoseNet library before, not only did I know that it was a good tool to do this, but I also knew it wouldn't take me too long to implement it again.

3D scene

I also knew I wanted to use 3D in the browser. One of the best libraries for this is the amazing Three.js.
Having used it before as well, I decided to go with it instead of trying something like Babylon.js or p5.js, only because my free time is limited and I want to optimise how I use it.

3D objects

The goal of the game is to slice some fruits and avoid bombs, so I needed to load these 3D objects in the game. Even though I could have gone ahead and designed them myself in softwares like Blender, this would have taken a lot longer. Instead, I used Poly to search through assets created by other people and available to download.

Hand trails

I wanted to visualise where my hand was in the 3D scene. I could have done it by showing a simple cube but I wanted to try something a little different. I had never tried to create some kind of "trail" effect so I did some research and found a really cool little library called TrailRendererJS that lets you create a nice looking trail effect.

Sounds

I also wanted to add some sounds to the game. Even though I could have done this using the native Web Audio API, I sometimes find it a bit difficult to use. There's a few JavaScript libraries that offer a level of abstraction but my favourite is Howler.js.

Hosting

I decided to host this experiment on Netlify, not because I work there, but because I find it the simplest and fastest way to deploy stuff at the moment.


Step 3: Building the thing

Without going through the whole code, here are some samples of the main features.

Pose detection

To use PoseNet, you need to start by adding the following scripts to your HTML, if you're not using it as an npm package:



<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/posenet"></script>


Enter fullscreen mode Exit fullscreen mode

Once you have access to the library, you need to load the model:



const loadPoseNet = async () => {
  net = await posenet.load({
    architecture: "MobileNetV1",
    outputStride: 16,
    inputResolution: 513,
    multiplier: 0.75,
  });

  video = await loadVideo();

  detectPoseInRealTime(video);
};


Enter fullscreen mode Exit fullscreen mode

Here, we start by loading the machine learning model, then we initialise the video feed and once both of these steps have completed, we call the function responsible for detecting the body position in the webcam feed.

The loadVideo function initiates the webcam feed using the built in getUserMedia function.



const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};

const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }

  const video = document.getElementById("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;

  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight,
    },
  });
  video.srcObject = stream;

  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};


Enter fullscreen mode Exit fullscreen mode

The detectPoseInRealTime function runs continuously.



const detectPoseInRealTime = async (video) => {
  async function poseDetectionFrame() {
    const imageScaleFactor = 0.5;
    const outputStride = 16;

    let poses = [];

    const pose = await net.estimateSinglePose(
          video,
          imageScaleFactor,
          flipHorizontal,
          outputStride
    );
    poses.push(pose);

    let minPoseConfidence = 0.1;
    let minPartConfidence = 0.5;

    poses.forEach(({ score, keypoints }) => {
      if (score >= minPoseConfidence) {
          const leftWrist = keypoints.find((k) => k.part === "leftWrist");
          const rightWrist = keypoints.find((k) => k.part === "rightWrist");

          console.log(leftWrist.position); // will return an object with shape {x: 320, y: 124};
      }
    });
    requestAnimationFrame(poseDetectionFrame);
  }
  poseDetectionFrame();
};


Enter fullscreen mode Exit fullscreen mode

Setting up a 3D scene

To start using Three.js, you need to load it:



<script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/110/three.min.js"></script>


Enter fullscreen mode Exit fullscreen mode

Then, you can start creating your scene, camera and renderer.



const initScene = () => {
  scene = new THREE.Scene();
  camera = new THREE.PerspectiveCamera(
    75,
    window.innerWidth / window.innerHeight,
    1,
    1000
  );

  camera.position.set(0, 0, 300);
  scene.add(camera);
};


Enter fullscreen mode Exit fullscreen mode


const initRenderer = () => {
  renderer = new THREE.WebGLRenderer({
    alpha: true,
  });
  renderer.setPixelRatio(window.devicePixelRatio);
  renderer.setSize(window.innerWidth, window.innerHeight);
  let rendererContainer = document.getElementsByClassName("game")[0];
  rendererContainer.appendChild(renderer.domElement);
};


Enter fullscreen mode Exit fullscreen mode

Loading 3D objects

To load 3D models, you need to add some additional loaders, the OBJLoader and MTLLoader. These two loaders will allow you to load the 3D objects and their material.



const fruitsModels = [
  { model: "banana/Banana_01", material: "banana/Banana_01", name: "banana" },
  { model: "apple/Apple_01", material: "apple/Apple_01", name: "apple" },
  {
    model: "bomb/bomb",
    material: "bomb/bomb",
    name: "bomb",
  },
];

const loadFruitsModels = () => {
  return fruitsModels.map((fruit) => {
    var mtlLoader = new THREE.MTLLoader();
    mtlLoader.setPath("../assets/");
    mtlLoader.load(`${fruit.material}.mtl`, function (materials) {
      materials.preload();

      var objLoader = new THREE.OBJLoader();
      objLoader.setMaterials(materials);
      objLoader.setPath("../assets/");
      objLoader.load(`${fruit.model}.obj`, function (object) {
        object.traverse(function (child) {
          if (child instanceof THREE.Mesh) {
            var mesh = new THREE.Mesh(child.geometry, child.material);
            fruitModel = mesh;
            fruitModel.name = fruit.name;
            fruits.push(fruitModel);
            generateFruits(1);
          }
        });
      });
    });

    return fruits;
  });
};


Enter fullscreen mode Exit fullscreen mode

In the code sample above, I am separating the step that loads the models to the one that appends them onto the scene. I am doing this as I want to load the models only once but be able to generate new objects as they appear/disappear from the screen.



const generateFruits = (numFruits) => {
  for (var i = 0; i < numFruits; i++) {
    const randomFruit = fruits[generateRandomPosition(0, 2)];
    let newFruit = randomFruit.clone();

    switch (newFruit.name) {
      case "apple":
        newFruit.position.set(0, 0, 100);
        break;
      case "banana":
        newFruit.position.set(0, 0, 0);
        break;
      case "bomb":
        newFruit.position.set(0, 0, 100);
        newFruit.scale.set(20, 20, 20);
        break;
      default:
        break;
    }

    fruitsObjects.push(newFruit);

    scene.add(newFruit);
    renderer.render(scene, camera);
  }
};


Enter fullscreen mode Exit fullscreen mode

To make the code sample above easier to read, I'm setting the position at x: 0, y: 0, however, in the real game, they are set randomly as the fruit is created and appended to the scene.

Mapping 2D coordinates to 3D position

This part is one of the trickiest and, to be honest, one that I don't think I can explain properly.

The complexity lies in the fact that the 2D coordinates from PoseNet don't map directly to coordinates in the Three.js scene.

The coordinates PoseNet gives us are the same you would get if you were logging the position of the mouse in the browser window, so the value on the x axis would go from 0 to over 1280 for the width in pixels.

However, coordinates in a 3D scene don't work the same way so you have to convert them.

To do this, we start by creating a vector from our hand coordinates.



const handVector = new THREE.Vector3();
// the x coordinates seem to be flipped so i'm subtracting them from window innerWidth
handVector.x =
    ((window.innerWidth - hand.coordinates.x) / window.innerWidth) * 2 - 1;
handVector.y = -(hand.coordinates.y / window.innerHeight) * 2 + 1;
handVector.z = 0;


Enter fullscreen mode Exit fullscreen mode

Then, we use the following bit of magic to map the coordinates to a 3D world and apply them to our hand mesh.



handVector.unproject(camera);
const cameraPosition = camera.position;
const dir = handVector.sub(cameraPosition).normalize();
const distance = -cameraPosition.z / dir.z;
const newPos = cameraPosition.clone().add(dir.multiplyScalar(distance));

hand.mesh.position.copy(newPos);


Enter fullscreen mode Exit fullscreen mode

Collision detection

This part is the other tricky one.

Only after the 2D coordinates have been mapped to 3D ones can we work on collision detection. From what I know, you cannot work on this collision detection directly from 2D coordinates to 3D objects.

The way we're doing this is by implementing what is called Raycasting.
Raycasting is the creation of a ray casted from an origin vector (our hand mesh) in a certain direction. Using this ray, we can check if any object in our scene intersects it (collision).

The code to do this looks like this:



const handGeometry = hand.mesh.geometry;
const originPoint = hand.mesh.position.clone();

for (
  var vertexIndex = 0; vertexIndex < handGeometry.vertices.length;
  vertexIndex++
) {
  const localVertex = handGeometry.vertices[vertexIndex].clone();
  const globalVertex = localVertex.applyMatrix4(hand.mesh.matrix);
  const directionVector = globalVertex.sub(hand.mesh.position);

  const ray = new THREE.Raycaster(originPoint, directionVector.clone().normalize()
  );

  const collisionResults = ray.intersectObjects(fruitsObjects);

  if (collisionResults.length > 0) {
    if (collisionResults[0].distance < 200) { // This distance value is a little bit arbitrary.
      console.log("Collision with a fruit!! 🍉");
    }
  }
}


Enter fullscreen mode Exit fullscreen mode

If you don't understand entirely what it does, it's ok, I find it pretty complicated.
The main parts you need to understand is that we clone the position of our hand (originPoint), we loop through all the vertices in the hand mesh, we create a Raycaster entity and check if the ray intersects with any fruit object. If it does, there's a collision!


Hand trails

To render the hand trail, the code is a bit long but if you want to have a look, I'd advise you to check the example from TrailRendererJS directly.

I just made some changes to fit the style that I wanted, and removed the bits I didn't need.


Playing sounds

To start using Howler.js, you need to add the following script tag in your HTML:



<script src="https://cdnjs.cloudflare.com/ajax/libs/howler/2.1.3/howler.min.js"></script>


Enter fullscreen mode Exit fullscreen mode

Once it's loaded, you can use it like this:



let newFruitSound = new Howl({ src: ["../assets/fruit.m4a"] });
newFruitSound.play();


Enter fullscreen mode Exit fullscreen mode

Challenges

Here are a couple of challenges I encountered while working on this project.

Positions in 3D

I find positioning objects in 3D quite tricky, especially when I am using OBJ models downloaded from somewhere else.
When I loaded the apple model in my scene, I assumed that when setting the x, y, z coordinates, I could use the same for the banana and bomb model, but it turned out not to be the case.

Depending on how the objects were created in a 3D software, they might have a different scale or position in their own bounding box. As a result, I had to manually test different positions and scale for each model loaded, which can take a while.

This issue also impacted the collision detection. Sometimes, raycasting didn't seem to work if I modified the scale of the object. I then played with the z axis to bring objects closer or further from the camera but, as a result, the collision detection doesn't work 100% of the time because of the check collisionResults[0].distance < 200.

Without this check for distance though, collision seems to be detected even when I don't hit a fruit on the screen so there's obviously something I don't quite understand there.

Performance

When working on this type of side projects, I know that the performance is not gonna be the best, because of how much I'm expecting the browser to handle.

However, considering I am doing live pose detection, 3D animation and collision detection in the browser, I don't think the current lag is THAT bad. 😬

We can always improve performance though, so I tried to do that by running a Lighthouse audit, fixed a few things, and ended up with a pretty good score.

Lighthouse score. Performance: 90, Accessibility: 100, Best practices: 100, SEO: 100

However, this is where web performance can mean different things.

Technically, performance metrics like first paint, time to interactive, etc... were pretty good, but it didn't mean that the game experience felt performant.

I don't know yet if there is really anything I could do about that but I think it would be an interesting area to dive into.


Extra resources

Other side projects using similar tech:

  • Beat Pose - Beat saber using hand movements, in the browser:

  • Qua*run*tine - Hiking trails triggered by running:


Hope it helps!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .