Build a realtime closed-caption system in React, AssemblyAI and Ably

ablydevin - Nov 6 '23 - - Dev Community

Closed-captioning for television systems was first demonstrated in the early 1970s. Realtime closed-captioning was developed later in the early 1980s. Here, stenotype operators who type at speeds of over 225 words per minute provide captions for live television programs, allowing the viewer to see the captions within two to three seconds of the words being spoken was developed later in the early 1980s.

As the internet has exploded the ability for anyone to create content, including live video and audio, has also grown. It remains important to consider making that content as accessible as possible not only allowing those with hearing impairment to consume your content, but also those learning to read, learning to speak a non-native language, or those in an environment where the audio is difficult to hear or is intentionally muted.

Sometimes, consumers of your content may simply prefer to read a transcript.

In this blog post, we will build a realtime browser-based closed-captioning system using React, AssemblyAI’s realtime transcription service, and Ably. If you want to jump ahead and check out the completed code, check out this Github repo, otherwise let's get started!

Prerequisites

Before you start coding, you’ll need to make sure you have the following prerequisites.

First, you’ll need to create an AssemblyAI account. We’ll be using their realtime transcription services as part of our project. Realtime transcription is priced at a per-second rate so while you will need to add a payment method to your account to use realtime transcription you’ll only pay for what you use.

Next, you will need a free Ably account. We’ll use Ably to broadcast the transcription to our application users.

Lastly, we will be building the server-side parts of the application in Node including using Nodes native fetch API which is available in Node 18 or newer.

Once you’ve got those accounts, then head on to the rest of the blog post to start to build!

Application setup

Start your application by using Vite to scaffold a basic React and JavaScript application. Open a terminal and run Vites create command:

npm create vite@latest

Vite will ask you a few questions about your project including the project name, an application framework (select React), and which framework variant you want to use (select JavaScript).

After the scaffolding completes, navigate into the project directory and install the project dependencies using npm. Finally, run the project to verify that the scaffolded project works as expected:

cd react-closed-captioning
npm install
npm run dev
Enter fullscreen mode Exit fullscreen mode

Add application routing

Our closed captioning application will have two modes:

  • A publisher mode that records and transcribes the audio.
  • A subscriber mode that receives the transcription text.

We’ll need to allow a user to navigate between those modes which we do by adding routing using the react-router-dom package.

Start by creating components in the applications src directory for the publisher (Publisher.jsx) and subscriber (Subscriber.jsx) modes:

import { useState } from "react";
import './App.css'

function Publisher() {}

export default Publisher
Enter fullscreen mode Exit fullscreen mode
import { useState } from "react";
import './App.css'

function Subscriber() {}

export default Subscriber
Enter fullscreen mode Exit fullscreen mode

Next, install the react-router-dom package using NPM:

npm install react-router-dom

In the main.jsx file import the BrowserRouter object from react-router-dom and wrap the default App component with it:

import React from 'react'
import { BrowserRouter } from 'react-router-dom'
import ReactDOM from 'react-dom/client'
import App from './App.jsx'
import './index.css'

ReactDOM.createRoot(document.getElementById('root')).render(
  <React.StrictMode>
    <BrowserRouter>
     <App />
    </BrowserRouter>
  </React.StrictMode>,
)
Enter fullscreen mode Exit fullscreen mode

In App.jsx, add import statements for the Routes, Route, and Link objects as well as the Publisher and Subscriber components you created earlier:

import { Routes, Route, Link } from 'react-router-dom';
import Publisher from './Publisher'
import Subscriber from './Subscriber'
Enter fullscreen mode Exit fullscreen mode

Finally, add the route links inside of the App components Empty tag:

<div style={{ textAlign: "center", marginBottom: 10, }}>
  <div>
    <Link to="/Publisher">Publisher</Link>
  </div>
  <div>
    <Link to="/Subscriber">Subscriber</Link>
  </div>
</div>
<div style={{ textAlign: "center" }}>
  <Routes>
    <Route path="/publisher" element={<Publisher />} />
    <Route path="/subscriber" element={<Subscriber />} />
  </Routes>
</div>
Enter fullscreen mode Exit fullscreen mode

Restart the app if needed by running npm run dev or refresh the page you should see links to the Publisher and Subscriber views.

assemblyai-ably-image

Generate service authentication tokens

The two services we use in the application, Ably and AssemblyAI, both require the application to authenticate itself with their APIs. To do that we need to create API tokens using server-side code. This lets us keep our service keys safe on the server and instead create temporary authentication tokens our client code can use to authenticate with the services.

To create the server code for the application we’ll use a Vite plugin called vite-plugin-api. This plugin makes it easy to create API endpoints as part of our application.

Use NPM to install the plugin package, express web server package, and the Ably client library:

npm install vite-plugin-api express ably
Enter fullscreen mode Exit fullscreen mode

Add the plugin to your Vite configuration file:

import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react'
import { pluginAPI } from "vite-plugin-api";

// https://vitejs.dev/config/
export default defineConfig({
  plugins: [react(),
    pluginAPI({
      // Configuration options go here
    }),
  ],
})
Enter fullscreen mode Exit fullscreen mode

The application will store those sensitive Ably and AssemblyAI service keys as environment variables. Vite makes it easy to access environment variables stored in a local file named .env.local. Create this file in the root of your project directory and then add the following keys:

VITE_ABLY_API_KEY=[YOUR_ABLY_API_KEY]
VITE_ASSEMBLYAI_API_KEY=[YOUR_ASSEMBLYAI_API_KEY]
Enter fullscreen mode Exit fullscreen mode

Make sure you replace the placeholder values with your actual Ably and AssemblyAI keys.

Next, inside of your project’s src directory create a new directory named api/tokens and add two files to it: realtime.js and transcription.js.

Inside of realtime.js add the following code which uses the Ably client library to generate and return a new Ably authentication token:

import Ably from "ably/promises";

export const GET = async (req, res) => {
  const client = new Ably.Rest(import.meta.env.VITE_ABLY_API_KEY);
  const tokenRequestData = await client.auth.createTokenRequest({ clientId: 'ably-demo' });
  console.log(`Request: ${JSON.stringify(tokenRequestData)}`)
  return res.json(tokenRequestData);
}
Enter fullscreen mode Exit fullscreen mode

Inside of transcription.js add the following code which uses Nodes fetch function to fetch and return a new AssemblyAI authentication token:

export const GET = async (req, res) => {
  const response = await fetch("https://api.assemblyai.com/v2/realtime/token", {
    method: "POST",
    body: JSON.stringify({ expires_in: 3600 }),
    headers: {
      Authorization: import.meta.env.VITE_ASSEMBLYAI_API_KEY,
   "Content-Type": "application/json",
    },
  });
  const token = await response.json();
  console.log(token);
  return res.json(token);
};
Enter fullscreen mode Exit fullscreen mode

Test your new API endpoints by loading the realtime (http://localhost:5173/api/token/realtime/) and transcription (http://localhost:5173/api/token/realtime/) routes in your browser. You should see responses similar to those seen in the images below:

Testing your API

API Token Responses

Create the publisher component

Now that we can generate tokens, we’ll implement the publisher component of the application. Remember that the publisher's job is to acquire realtime audio, send it to AssemblyAI for transcription, and then broadcast the transcription results to subscribers.

Start by adding some markup to the Publisher component:

function Publisher() {
  return (
    <div>
      <p id="real-time-title">Click start to begin recording!</p>
      <button id="start">Start</button>
      <button id="stop">Stop</button>
      <p id="message"></p>
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

This markup adds two buttons to start and stop publishing and an area to display the transcription.

Two buttons to start and stop publishing

Next, add a variable to store an audio recorder object (which we will create in a future step) and button click handler functions for the start and stop buttons.

const texts = {};
let recorder = null;

async function start(e) {}
async function stop(e) {}
Enter fullscreen mode Exit fullscreen mode

Connect those handler functions to the buttons using their onClick event:

<button id="button" onClick={start}>Start</button>
<button id="button" onClick={stop}>Stop</button>
Enter fullscreen mode Exit fullscreen mode

Open an AssemblyAI WebSocket connection

AssemblyAI uses WebSockets as the transport for receiving raw audio and returning its transcription results. To use the AssemblyAI WebSocket endpoint in our React application we’ll use the react-use-websocket library. This library gives us a React-idiomatic way of using WebSockets.

Install react-use-websocket using NPM:

npm install react-use-websocket
Enter fullscreen mode Exit fullscreen mode

Once installed, add another import statement into Publisher.jsx to import the useWebSocket object from the library:

import useWebSocket from "react-use-websocket";
Enter fullscreen mode Exit fullscreen mode

Add two React state properties, one to store the WebSocket URL and a second to toggle opening and closing the WebSocket connection:

const [socketUrl, setSocketUrl] = useState('');
const [shouldConnect, setShouldConnect] = useState(false);
Enter fullscreen mode Exit fullscreen mode

Notice that the default for the shouldConnect property is false. By default, the react-use-websockets library will immediately try to open a WebSocket connection. In our application, however, we only want to open that connection when the Start button is clicked.

Next, use the useWebSocket hook to configure the WebSocket connection.

const { sendJsonMessage, lastMessage, readyState } = useWebSocket(socketUrl,
{
  onOpen: () => {
    console.log('AssemblyAI WebSocket opened');
  }
}, shouldConnect);
Enter fullscreen mode Exit fullscreen mode

The useWebSocket hook accepts three parameters.

  • url - The WebSocket URL we want to connect to.
  • options - An object containing WebSocket configuration options.
  • connect - A boolean value allowing us to control the library's automatic connection behavior.

Notice that our options object is defining a function for the onOpen event. We’ll use this event in just a moment to start recording realtime audio once the WebSocket connection opens.

In the start click handler add code that fetches a new AssemblyAI authentication token, appends that token to the WebSocket URL, and then toggles the shouldConnect value. This last step will trigger the useWebSocket hook to try to open a connection to the WebSocket URL.

async function start(e) {
  const response = await fetch("/api/tokens/transcription");
  const { token } = await response.json();
  setSocketUrl(`wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`);
  setShouldConnect(true);
}
Enter fullscreen mode Exit fullscreen mode

Capturing audio in the browser

If the WebSocket connection opens successfully we can start to capture realtime audio and send it to AssemblyAI for transcription. To capture audio we’ll use the browsers built in getUserMedia function. The getUserMedia function is part of the MediaDevices set of APIs that browsers implement in order to provide developers with safe and standardized ways to access local media hardware devices like microphones, cameras, and speakers.

Calling getUserMedia, in our case, asking only to access audio, returns a MediaStream with tracks containing the requested types of media. Once we have the MediaStream we’ll use the RecordRTC library to transform the raw audio track into a format that we can pass to AssemblyAI. RecordRTC is an abstraction over the native MediaRecoder APIs and provides a number of capabilities that simplify recording audio in the browser.

Install recordrtc using NPM:

npm install recordrtc
Enter fullscreen mode Exit fullscreen mode

Once installed add an import statement importing the RecordRTC and StereoAudioRecorder objects from the library:

import RecordRTC, { StereoAudioRecorder } from "recordrtc";
Enter fullscreen mode Exit fullscreen mode

Inside the onOpen event handler call getUserMedia to access an audio MediaStream. Pass that stream to a new instance of a RecordRTC object. Based on the audio requirements of AssemblyAI’s realtime streaming API, configure the instance of the RecordRTC object to encode 250ms of audio as a single channel track using the pcm codec at a sample rate of 16000Hz.

onOpen: () => {
  console.log("AssemblyAI WebSocket opened");
  navigator.mediaDevices
      .getUserMedia({ audio: true })
      .then((stream) => {
        recorder = new RecordRTC(stream, {
        type: "audio",
        mimeType: "audio/webm;codecs=pcm", // endpoint requires 16bit PCM audio
        recorderType: StereoAudioRecorder,
        timeSlice: 250, // set 250 ms intervals of data that sends to AAI
        desiredSampRate: 16000,
        numberOfAudioChannels: 1, // realtime requires only one channel
        bufferSize: 16384,
        audioBitsPerSecond: 128000,
        ondataavailable: (blob) => {
          const reader = new FileReader();
          reader.onload = () => {
            const base64data = reader.result;
       // audio data must be sent as a base64 encoded string
           sendJsonMessage({
             audio_data: base64data.split("base64,")[1],
          });
        };
        reader.readAsDataURL(blob);
      },
    })
    recorder.startRecording();
  })
}
Enter fullscreen mode Exit fullscreen mode

Each time RecordRTC records 250ms of data it calls the ondatavailable event passing the recorded audio as a Blob. Use this event to base64 encode the audio and use the sendJsonMessage function of useWebSockets to send the encoded data to AssemblyAI.

Finally, after creating the new RecordRTC object, call the startRecording function which triggers the recording of audio and the process of converting and sending it to AssemblyAI.

Pause the recorder

Starting the recording and transcription process is great, but at some point, you’ll probably want to stop it as well. To do that use the stop button click handler to toggle the shouldConnect value to false, closing the WebSocket connection and pause the RecordRTC recorder.

async function stop(e) {
  setShouldConnect(false);
  if (recorder)
    recorder.pauseRecording()
}
Enter fullscreen mode Exit fullscreen mode

Let’s make sure the application is working as expected. Go ahead and try pressing the Start button. You may be prompted to give your application permission to access the microphone. If you are, go ahead and give permission.

Transcription using AssemblyAI

At this point our application is now capable of sending audio to AssemblyAI for transcription. AssemblyAI will perform that transcription and begin sending it back via the same WebSocket connection transcription data.

Start by adding a new state property to store transcribed text:

const [transcription, setTranscription] = useState('');
Enter fullscreen mode Exit fullscreen mode

And use that state property to display the text on our page:

<p id="message">{transcription}</p>
Enter fullscreen mode Exit fullscreen mode

As AssemblyAI transcribes the audio we’re sending, it uses the open WebSocket connection to send back to us messages containing partial transcription results. Those messages look roughly like this:

{
  "message_type":"PartialTranscript",
  "created":"2023-10-13T18:36:02.574861",
  "audio_start":0,
  "audio_end":2230,
  "confidence":0.805910792864326,
  "text":"this is a test",
  "words":[
    {"start":1430,"end":1465,"confidence":0.999489806225809,"text":"this"},
    {"start":1470,"end":1505,"confidence":0.676980979301034,"text":"is"},
    {"start":1910,"end":1945,"confidence":0.763494430232322,"text":"a"},
    {"start":1950,"end":1985,"confidence":0.783677955698138,"text":"test"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

As AssemblyAI continues to receive additional audio, it may send back partial responses containing some or all of the same audio samples. This means that we have to create a way to detect if we need to replace an existing partial response or if we’ve received a final response for a particular time slide of audio.

We’ll do that by using the audio_start value contained in each message and creating a function called processPartialTranscription that can store and update messages based on that value.

const processPartialTranscription = (source) => {
  let msg = "";
  const res = JSON.parse(source.data);
  if (res.text) {
    texts[res.audio_start] = res.text;
    const keys = Object.keys(texts);
    keys.sort((a, b) => a - b);
    for (const key of keys) {
      if (texts[key]) {     
        msg += ` ${texts[key]}`;
      }
    }
  }
  return msg;
}
Enter fullscreen mode Exit fullscreen mode

Each message is added to a dictionary using the audio_start value as the key. If that key already exists then its content is overwritten. Once the data is added to the dictionary it is sorted by its keys (in other words in time order). The sorted keys are looped over to create a single transcription result string.

To receive the WebSocket messages from AssemblyAI the useWebSockets hook gives us access to the last incoming message through the lastMessage property. We can use an Effect based on the lastMessage property to update the transcription text.

Import useEffect from React:

import { useState, useEffect } from "react";
Enter fullscreen mode Exit fullscreen mode

Create a new Effect which checks if lastMessage is not empty and if not passes the last message through the processing function before setting the value to the transcription property:

useEffect(() => {
  if (lastMessage !== null) {
    setTranscription(processPartialTranscription(lastMessage));
  }
}, [lastMessage]);
Enter fullscreen mode Exit fullscreen mode

Run the app. Start the recording and watch as text transcription shows up.

Broadcasting transcription messages

So far in our application we’ve done everything as the Publisher, meaning the person creating the audio is also receiving the transcription text. But closed-captioning is about broadcasting the audio transcription to a wide audience.

AssemblyAI only gives us a single WebSocket connection. To fan out to many transcription consumers we’ll use Ably.

Start by importing the Ably client library into the App component:

import { AblyProvider } from "ably/react";
import * as Ably from 'ably/promises';
Enter fullscreen mode Exit fullscreen mode

Create a new connection to Ably:

const client = new Ably.Realtime.Promise({ authUrl:'/api/tokens/realtime' });
Enter fullscreen mode Exit fullscreen mode

Enclose the Routes component with an AblyProvider, setting its client property to the Ably client you just created. The AblyProvider holds and manages the Ably WebSocket connection for you and makes it accessible to any child component:

<AblyProvider client={ client }>
  <Routes>
    <Route path="/publisher" element={<Publisher />} />
    <Route path="/subscriber" element={<Subscriber />} />
  </Routes>
</AblyProvider>
Enter fullscreen mode Exit fullscreen mode

To broadcast the transcribed messages from the Publisher to Subscribers we need to access an Ably Channel and publish messages to it. The Ably client library includes a set of React hooks that make using Ably in React applications simple.

In the Publisher component import the useChannel object:

import { useChannel } from "ably/react";
Enter fullscreen mode Exit fullscreen mode

Within the component use the useChannel hook to access an Ably channel:

const { channel } = useChannel('closed-captions');
Enter fullscreen mode Exit fullscreen mode

Finally, publish messages to that channel by calling the channels publish method from inside of the processPartialTranscription function:

The application should now begin to publish the JSON formatted transcription data that AssemblyAI is sending the publisher. You can validate that this is happening correctly by logging into the Ably dashboard and connecting to the closed-captions channel using the Dev Console.

Connecting to the closed-captions channel

Create a subscriber component

With messages being published to Ably, we can shift gears to the Subscriber component. We’ll want the application to consume and display the messages being published into the closed-captions channel.

Import the useState hook from React and Ably useChannel hook into the Subscriber component:

import { useState } from 'react';
import { useChannel } from "ably/react";
Enter fullscreen mode Exit fullscreen mode

Add a variable to hold partial snippets and a state value to hold the current transcription message:

const texts = {};

const [transcription, setTranscription] = useState('');
Enter fullscreen mode Exit fullscreen mode

Add some basic HTML markup that defines two buttons and a place to display the transcription text:

return (
  <>
    <div>
      <p id="real-time-title">Listening for transcription messages!</p>
      <p id="message">{transcription}</p>
    </div>
  </>
);
Enter fullscreen mode Exit fullscreen mode

Add the processPartialTranscription method:

const processPartialTranscription = (source) => {
 let msg = "";

 const res = JSON.parse(source.data);
 if (res.text) {
  texts[res.audio_start] = res.text;
  const keys = Object.keys(texts);
  keys.sort((a, b) => a - b);
  for (const key of keys) {
   if (texts[key]) {     
    msg += ` ${texts[key]}`;
   }
  }
 }
Enter fullscreen mode Exit fullscreen mode

useChannel has a callback function that gets called when new messages are received. Use that to receive and process incoming transcription messages and update the transcriptions state value:

const { channel } = useChannel("closed-captions", (message) => {
 setTranscription(processPartialTranscription(message));
});
Enter fullscreen mode Exit fullscreen mode

That’s it! Open the application in two separate browser windows, one with the Publisher route loaded and the other with the Subscriber route loaded. In the publisher view, start the recorder and watch as transcription text begins to show in both the publisher and subscriber views.

Wrap up

Congrats on building this realtime closed-captioning application! In building his application you learned how easy it is to capture realtime audio from the browser, send it to AssemblyAI for transcribing and broadcast the transcription to an audience.

Combining AssemblyAI’s realtime transcription with Ably’s realtime message broadcasting gives you a powerful set of tools for your developer toolbelt.

If you’re looking to take this application further, one idea is to allow users of your application to select a language in which to receive the transcribed text, using a service like https://languageio.com/ to translate the text in realtime.

I'm always interested in how developers are combining tools like AssemblyAI or Language.io with Ably. Tell me how you are mashing tools together by hitting me up on X (Twitter)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .