Speech Transcription 2.0: Emojis, the New Language of Emotion! 😎

Mary đŸ‡ĒđŸ‡ē 🇷🇴 đŸ‡Ģ🇷 - Feb 2 - - Dev Community

Warning: Original Text in French

This text was initially written in French. It is recommended for those who can read it to refer to the original version for a more accurate understanding. French version


In the last episode, I shared a slice of life at Microsoft, explaining how diversity and inclusion translate into our daily work and products.

Last time I left you with 3 issues related to automatic transcription through speech-to-text or STT.

  1. Recognition, though very impressive, is not perfect, especially if you have an accent.
  2. Even very short delays inevitably create lag and make interruptions difficult.
  3. And third... I didn't mention the third to avoid spoiling my 2023 project, but since it's 2024, I can now tell you. Thirdly, there is crucial information simply ignored with Speech-to-Text, and that's the intonation of your voice.

Intonation and Communication

Intonation plays a crucial role in communication, as it can convey emotional nuances, intentions, questions, statements, or exclamations. When I write the following sentence: "let's go," it can be interpreted differently:

1."Let's go?" with a questioning tone asking if we should proceed.
2."Let's go!" with an enthusiastic tone showing my excitement to go.
3."Let's go." with a neutral tone just giving you factual information.
4."Let's go?!" with a scared tone asking for confirmation that the big spider on my sweater has been exterminated.

If someone says a sentence orally, it can have multiple meanings based on the intonation of their voice, and this information is completely lost when using STT.

I realized this because I tend not to invert the subject and verb or not use the auxiliary "Do" when asking a question in English, which is quite common in French as well. You then realize that someone is asking you a question through the tone of their voice. But if the person in front of me can't hear me and can only read, they might take a statement for a question when it was, for me, a question.

And that's very annoying!

So, Now What?

Now that it's said, what can we do? Well, firstly, we could look into a solution that would be ideal to fill this gap. In writing, we've found an excellent system to convey emotions - emojis 😊

I'm happy 😊, I'm not happy ☚ī¸ I'm surprised 😮 I'm angry 😡

Emojis are definitely the best way in writing to convey emotion and translate the intonation needed to read a sentence.

Naturally, a question arises: we can translate hours of a meeting in real-time, why don't we have emojis integrated into the transcript?

So, parenthesis, there are software programs that chop up a video into shorts, add illegible subtitles to it, and randomly throw in emojis.

These services are mediocre, even bad, or very bad, in fact, it's terrible. Besides writing unreadable subtitles word for word, the emoji doesn't serve to describe an emotion but just to create a Christmas garland.

End of parenthesis

So why don't we still have emojis in Teams that would translate our intention? Well, because transcription tools use speech-to-text, and speech-to-text is used to recognize words and words only.

The Basics

Speech-to-text can undoubtedly be done in several ways, but I will present to you a fairly classic, effective, and easy-to-understand way.

Speech is producing sounds that have meaning. To transcribe a sound in writing, we use phonetics, and phonetics is entirely useless for a computer. We humans can't even read it, so a program, really?

So phonetics doesn't help us much currently, but wait.

Another way to visually represent a sound is the spectrogram or waveform.

Here's a "Yes" waveform:

Image: Yes waveform

And here's a "No" waveform:

Image: No waveform

So now it's quite easy to identify what a word is. It's a sound that has unique characteristics in terms of tone, length, etc.

The two waveforms I gave you don't come from my voice but from a classic dataset used to train STT models.

If I transform my voice into a waveform, here's what it looks like:

"Yes"

Image: My voice saying "Yes" waveform

"No"

Image: My voice saying "No" waveform

We can see a similarity between my voice and the model's, particularly on "Yes," less on "No," probably because I don't emphasize the beginning of the word but the end.

Speech-to-text is trained on audio datasets, which are transformed into different graphs from which we can extract features to classify them into words.

A well-trained model can extract features from my voice that will help it determine if I said "Yes," "No," or "Pandi Panda."

Image: My voice saying "Pandi Panda" waveform

We understand a bit better how speech recognition works and how we can recognize words. It works quite well, which is cool. But then, for emotions, how can we do it?

Can we use the same technique with spectrograms and waveforms?

Well, let's see that in the next part!


A big thank you for exploring this topic with me! If you found it interesting, share your thoughts in the comments below. 🚀 Don't forget to follow me here or on Twitter @pykpyky to stay updated on upcoming adventures. Your comments and support are always welcome!

. . . . . . . . . . . . . . . . . . . . . . . . . .