Understanding Speech Recognition and Building a Voice Controlled To-Do List

Deon Rich - Nov 3 '21 - - Dev Community

One of the biggest and most important building blocks of modern technology, hands-down, is AI. Machine learning is a completly different animal when it comes to how machines process information. I consider it to be one of the key stepping stones for bridging the gap between the way machines process and understand data, and the way we humans think and take in information.

It can seem rather impossible for a machine to emulate the capabilities that our human minds do, with the unique way we have of learning, understanding, mapping information and extracting context from data. This especially applies when talking about human language and speech. But as always, technology has found a way!

In this post, I thought it would be interesting for us to take a look at how modern speech recognition technology works, as examplified by technologies such as Google Assistant, Amazon's Alexa and Apple's Siri. Then lastly, we'll look at an example of how we can utilize this tech in our projects by building a voice controlled to-do list using the Web Speech API! 😉

The Building Blocks of Speech Recognition

When it comes to how machines understand and process language, more specifically in the form of audio, there exsists two fundamental concepts that must be implemented for speech recognition to be possible:

  • Automatic Speech Recognition (ASR): Though ASR is usually used as an umbrella term for the concepts behind speech recognition, it primarily refers to the process of breaking down speech in the form of audio, and applying algorithms to transcribe the peices of audio. This is the main concept behind Speech-To-Text programs, and allows a machine to know what you're saying, but not the meaning behind it.

  • Natural Language Processing (NLP): Refers to the process of understanding or emulating language. The act of constructing and or deconstructing key points in natural speech. This is the main player behind programs like Alexa, which are able to not only know what you're saying, but understand it based on the summary it formulates from your speech (NLU), or even respond back (NLG). The concepts used in NLP are applyed in both NLG (Natural Language Generation) and NLU (Natural Language Understanding), as its used as an umbrella term for both.

Both NLP and ASR are implemented using algorithms based on machine learning, neural networks and deep learning, and are heavily based on linguistics, semantics and statistics. And considering how complex human language is, this is the right approach.

These technologies aren't perfect, however. Human language cannot be simply analized like any other set of data. There exists anthimeria, sarcasm, slang, implication, words with double meaning, figures of speech and a whole lot of other quirks that a machine is going to have to learn to identify over time. Not to mention, this all varies from language to language.

So how do ASR and NLP acomplish what they do? Lets take a bit of a closer look..👇

The Process of ASR

The main steps behind ASR that a program will take go as follows. Note that the process may vary depending on a specific programs end goal:

  1. The program receives an audio input.
  2. This audio is refined, as the program attempts to isolate the speech from background noise.
  3. The resulting speech is split into phonemes. Phonemes are small units of sounds unique to a language that are commonly used to construct words, and can be used to differentiate one word from another, or where one word may start or where another may end.
  4. The phonemes are then analysed, and the AI uses its aquired knowledge of that language to determine the most likely word that would follow based on the sequence of sounds. Once it forms words, the same probability analysis is applied to determine what a sentence might be.

The Process of NLP

The main steps behind NLP (or more specifically NLU) that a program will take go as follows. Note that the process may vary depending on a specific programs end goal:

  1. The input speech is separated into sentences. The resulting sentences are then split into separate words, this is called tokenization.
  2. The tokenized words are then analysed and given roles (nouns, verbs or adjectives) based on the surrounding sentence.
  3. Non lemmas are then lemmatized, meaning they're mapped to their basic form, to signal that they pose the same meaning (broken -> break).
  4. Common words such as "and","or" and "a" are removed, as to isolate the words which hold the primary meaning.
  5. Dependency Parsing is then preformed, and a tree is created, associating words which depend on each other together (chocolate -> best -> icecream -> flavor).
  6. Named Entity Recognition (NER) is preformed, labeling each noun based on the real world thing they're meant to represent (Peter Parker -> fictional character).
  7. Lastly, Coreference Resolution is done on pronouns such as "it", "she", "he" and "they", in order to link them to the noun that they're refering to. Once this is done, the program can properly deduce the meaning behind the speech sample (she -> Lisa Ross).

Of course its important to remember that there is much more that goes into these proceses within a real implementation of NLP and ASR. In order to actually execute each of these steps advanced algorithms and methods are utilized, such as Hidden Markov Models, Dynamic Time Warping, and Neural Networks to list a few.

Anyway, now that we've got a good idea of how ASR functions, lets get our hands dirty at look at how we can use it in our code by utilizing the Web Speech API, and building a Voice Controlled To-Do List! 😁

Note: The Web Speech API is still in an experimental phase. It may not support a given browser and its implementation is still incomplete. That being said it should be used in personal projects only. There already exists stable STT APIs out there such as those listed here, but im going with the Web Speech API as its extreamly simple, easily accessable and suffices our needs.

Building a Voice Controlled To-Do List

Before I walk you through our example, let me first show you the finished product. Here however, the embed isn't allowed media access, which breaks its functionality. If you want to see how it functions you can view it here. Anyway, heres what it will look like:

The functionality is pretty simple. Once the switch is flipped, the speech recognition service will start listening for speech. Our program will first expect the user to give a title, and once thats given it will then expect a description. After the description is spoken, a new task will be added to the UI, containing the title and description the user entered. The state of the program (weather its active, or what peice of information its expecting next) will be expressed in the message above the switch.

Simple, right? Lets go over the code..

Using the Speech Recognition Service

First, lets cover the most important step, which is starting the speech recognition service through the Web Speech API. When I say "speech recognition service", im refering to the default speech recognition service build into Chrome. Our audio will be captured and sent to a server (the speech recognition service) where it will be processed, and then sent back.

First, we need to establish a connection with it:

  // Setup recognition service
if (webkitSpeechRecognition || SpeechRecognition) {
  const recog = new (webkitSpeechRecognition || SpeechRecognition)();
  const grammarList = new (webkitSpeechGrammarList || SpeechGrammarList)();
  recog.grammars = grammarList;
  recog.lang = "en-US";
} else {
  // Alert user if API isnt supported
  alert("Sorry, your browser dosen't support the WebSpeech API!");
}
Enter fullscreen mode Exit fullscreen mode

Here all we do is first ensure that the API exists within the window object. Once thats done we instantiate a new SpeechRecognition object, which is the interface for interacting with the speech recognition service.

Its primary events and methods are as follows:

  • start(): Begin listening for speech.
  • stop(): Stop listening for speech.
  • abort(): Stop listing for speech without returning result.
  • result: Fires when a result is returned from the recognition service. The result is passed to the callback.
  • end: Fires when the speech recognition service is stopped.
  • start: Fires when the speech recognition service is started.
  • speechstart: Fires when speech is detected.
  • speechend: Fires when speech is no longer detected.

Then I attach a SpeechGrammarList to it via SpeechRecognition.grammars. SpeechGrammarList is an object which is meant to hold a list of grammars. A grammar (though through this API is given as a string) is a special file that is sent to the speech recognition service, which gives it the grammar or key words or phrases that it should be listening for. Through grammars you're also able to tell it which phrases it should listen for more than others, by specifying their "weight".

Grammars are typically specified using the SRGS(Speech Recognition Grammar Specification) format, or the JSGF(Java Speech Grammar Format). However, at the moment this API dosen't support grammars very well, as they have little effect on the results of the speech recognizer. Thus, I give it an empty SpeechGrammarList.

After that, we set the language the recognizer should listen for via the SpeechRecognizer.lang property, which in this case is english.

And thats really all there is to it to get up and running. Now we just need to fill in the gaps we need to integrate it into our to-do list!

Putting it All Together

Because the UI isn't as important i'll only be going over the javascript, but you can give it a closer look here or in the embed I showed earlier. You can of course make the UI look however you want if you intend on following along.

The main idea is that we simply have a main button to toggle the speech recognition service, a message to indicate the state of the program (active, listening, or what info its expecting), and an area where the resulting tasks will appear.

To wrap up, i'll breifly go over each segment of functionality.

// Only proceed if API is Supported...

if (webkitSpeechRecognition || SpeechRecognition) {

/*
"active" will be used to keep track of weather 
or not the service is active.

"phase" will be used to keep track of what 
information is currently being 
expected (either the title or description).

"task" is simply a container for our information 
when results are received.
*/

  let active = false,
    phase = undefined,
    task = {};

//________________________________________

/*
Listen for when the switch is toggled. 
If the service is inactive, start the service 
and set the phase to "title" since 
this is the first peice of info once 
the service is listening. 
If the service is active, do the opposite.
*/

// select message element above switch
  const message = document.querySelector(".container__message");

// select toggle switch
  const button = document.querySelector(".container__button");

  button.addEventListener("click", () => {
    if (!active) {
      recog.start();
      active = true;
      phase = "title";
      message.innerText = "waiting for title...";
    } else {
      recog.abort();
      active = false;
      phase = undefined;
      message.innerText = "Flip switch to toggle speech recognition";
    }
  });
// ________________________________________

/*
"addTask()" will be after both the title and description are 
spoken. This will hide the placeholder 
and fill a copy of the tasks template with the 
respective information. It'll then be 
appended to the tasks container
*/

// Select container where tasks will appear, its placeholder and the template for the tasks.
  const tasks = document.querySelector(".container__tasks"),
    placeholder = document.querySelector(".container__tasks__placeholder"),
    template = document.querySelector("#task-template");

  // Function for appending tasks
  function addTask() {
    placeholder.style.display = "none";
    let content = template.content.cloneNode(true);
    content.querySelector("p").innerText = task.desc;
    let n = content.querySelector("div");
    content.querySelector("summary").innerText = task.title;
    content.querySelector(".x").addEventListener("click", () => {
      tasks.removeChild(n);
      if (tasks.children.length === 2) {
        placeholder.style.display = "block";
      }
    });
    tasks.appendChild(content);
  }

//________________________________________

/* Setting up the recognition service, 
as already explained previously */

 // Setup recognition service
  const recog = new (webkitSpeechRecognition || SpeechRecognition)();
  const grammarList = new (webkitSpeechGrammarList || SpeechGrammarList)();
  recog.grammars = grammarList;
  recog.lang = "en-US";

//________________________________________

/* Inform user that service is listening when speech s detected */

  // Let user know recognition service is listening
  recog.addEventListener("speechstart", () => {
    message.innerText = "listening...";
  });

//________________________________________

/*  
Register an event listener for when the result comes in, 
which will be each time the user stops 
speaking and speech was recognized. 

In the callback, if the phase is currently 
"title" (the program is waiting for the title) 
add the title to the "task" object and 
switch phase to "desc".

If the the phase is currently "desc" 
(we already have the title, and are waiting for the description) 
add the description to the "task" object, call "addTask()" 
and inform the user that the task was successfully added.
*/

  // Determine what to do with result once it comes in
  recog.addEventListener("result", (res) => {
    let result = res.results[res.results.length - 1][0].transcript;
    switch (phase) {
      case "title":
        task.title = result;
        message.innerText = "waiting for description...";
        phase = "desc";
        break;
      case "desc":
        task.desc = result;
        message.innerText = "task added!";
        phase = "title";
        window.setTimeout(() => {
          message.innerText = "waiting for title...";
        }, 2000);
        addTask();
        break;
    }
  });

//________________________________________

  // Keep service open by restarting it, since it ends after each speech segment it receives.
  recog.addEventListener("end", () => {
    if (active) recog.start();
  });

  // Cancel service if error occurs
  recog.addEventListener("error", () => {
    recog.abort();
  });
} else {
  // Alert user if API isnt supported
  alert("Sorry, your browser dosen't support the WebSpeech API!");
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

And there you have it folks! An introduction to how ASR works, and a small example of how you can implement it into your projects. If you want to dive deeper into ASR, NLP or the Web Speech API, you should check out these resources below..👇

ASR: https://verbit.ai/asr-and-the-next-generation-of-transcription/
NLP: https://medium.com/@ritidass29/the-essential-guide-to-how-nlp-works-4d3bb23faf76
Web Speech API: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API;

Thanks for reading..😊!

. . .