It's been a while, but this blog post has a first part that I invite you to read to fully appreciate this one.
Last time, we talked about how the STT works and wondered if a similar technique could work for Speech Emotion Recognition (SER) to detect emotions through voice.
Well, the good news is yes! And we'll take advantage of it because that will be the only good news in this blog post, I warn you!
So yes, to do STT, the waveform of our word or phrase allows us to transcribe audio into text, and here are examples of the waveform of the same phrase but with different intonations:
Kids are talking by the door
The spoken phrase is as follows: "Kids are talking by the door."
And here's the waveform with a neutral tone:
You have to zoom in a bit to see anything, but overall, you can see that it's pretty flat, which is normal for a neutral emotion, or should I say, an absence of emotion!
So we can clearly perceive a difference, which we will probably be able to exploit. We can even go further by looking at other ways to represent sound, for example, with a spectrogram, and here's what it looks like:
Of course, Machine Learning models that work for STT don't work as is for emotions. We'll see this in more detail in the next blog post, but that's the first bad news; we need to search for and apply other models.
A towering challenge
The second real bad news is the dataset to train the model. For STT, we have plenty. A lot even. Gigabytes and gigabytes very easily accessible, in all languages. So, we have no problem having enough data to train a model that can give satisfactory results in production.
For SER, however... It's another story. We have datasets, but they're quite small and mainly intended for research, and to be used to compare SER Machine Learning models and thus determine which one is the best.
To explain the difference, here's a dataset for an STT model for Ukrainian: Ukrainian Speech, Speech To Text (STT/ASR). It's 7GB and we have about 7 hours of audio. That's very small for STT; for other datasets, we can find terabytes of data and over 20,000 hours of audio in just one dataset.
Here's a dataset for an SER model in English: (Speech Emotion Recognition (en)), it's... 1.64... GB, and it's actually already several different datasets put together.
In addition to this difference in size, we must not forget that a data sample for STT is an audio file, for SER, it's one audio file per emotion. So, to have the same amount of data between a dataset for STT and one for SER composed of 4 emotions, we would need a volume 4 times bigger! (And generally, SER datasets have more than 4 emotions, which further reduces the number of audio files per emotion).
In short, models and datasets are the two problems that need to be solved before going any further.
Let's start with perhaps the simplest to address, the datasets. How to get more data?
Data augmentation
The real solution is of course to collect more, which is easier said than done, especially since voice being a biometric data, we don't collect it so easily. You can always add your own voice or that of your friends/family to the existing dataset, but even with a large family, that's not going to really improve it.
At best, having your own voice in the training data of your model makes it de facto more effective on your own voice, and therefore allows you to test your use cases more effectively. But this feeling of success will quickly fade when your model is confronted with real data not present in the training data.
So, it's a good solution for a demo, but it won't make your model ready to be used in production.
However, there are techniques to easily expand your dataset, it's called data augmentation.
First, you can add noise. This technique involves copying all your audio files and adding noise to each one (white noise, background noise, TV...). This easily doubles your dataset for your model. Adding noise also makes your model more efficient in real cases where there is often background noise when speaking.
Here for example a clean audio:
Then the same with noise added:
Then, you can change the speed of the files, like the first technique, you need to copy your original files, then speed up or slow down the audio file slightly. Again, your dataset will be doubled or even tripled, but it will also allow your model to train on different audio file durations.
Finally, and this is already the technique used by the dataset proposed above. It's about mixing several datasets together.
This is a valid technique of course, but it is necessary to pay attention to certain points. The main risk is to have an unbalanced dataset with certain types of voices, or even certain voices overrepresented compared to others. Your machine learning model is therefore likely to have biases, which is what we normally try to avoid.
So, we now have a dataset, not exceptional, not sufficient to hope to be used in production, but it's a first step that will allow us to create a model, do a proof of concept, and continue to move forward in the right direction.
In the next post, we'll go a little deeper into the technique, especially in building the model.
A big thank you for exploring this topic with me! If you found it interesting, share your thoughts in the comments below. 🚀 Don't forget to follow me here or on Twitter @pykpyky to stay updated on upcoming adventures. Your comments and support are always welcome!