This will be the last post for AudioSign as we'll be going into the possible ways it could be built.
đź› Building AudioSign
The System would work by taking audio input, translating it into sign language and then having it signed by an animated character. There are 2 ways this could be implemented:
Live generation; this implies real-time translation from speaking directly to the animated signing.
Saved generation; this deals with already existing data translation. Uploading an audio file and converting it to animated signing.
To achieve this, we'll be using ASL(American Sign Language) for explanations. This is because ASL is the most popular sign language in America with enough data to use for this task. Now signing involves the spelling of words using the fingers so you could think of it like texting but with some added nuance.
Usually, when we speak or are having conversations, it's not just our mouths that move, we aren't just deadpan saying words, our facial expressions, body language, the tone and intensity of our words all play a role in getting the message across. The same can be said for signing, these actions are what we call non-manual signals, they are a natural part of conservations that occur when we speak and they also exist in sign language. For example, when asked a yes/no question, instinctively when replying there is some subtle movement to affirm/negate the choice and in signing this is also necessary to the conversation.
So, if we were to build a system that can take audio input, convert it to ASL and then animate a 3D character signing the translation, these are its necessary functionalities:
- It should be able to take the audio and then convert it to ASL while being able to understand/store context.
- Using the ASL database and Mo-cap(Motion Capture) of a human translator, it should be able to map and learn how to structure sentences and then make unlearned sentences from that information.
- The context should not be lost in the translation such that intensity, tone and expressions are properly conveyed in the animation.
- The animation should be smooth and natural to put the receiving party at ease.
👉👉 Likely workflow to achieve this
There exist ASL databases that could be combined with Deepgram's audio to text models to learn translation from the spoken domain to the signed domain. Another system can then be developed that would use the ASL database to map the signing of mo-cap done by a deaf translator to create an animated version of the database.
NB: context from audio is not easy for ai to learn as many researchers are looking into sentiment analysis but a perfect method has yet to be developed.
These two systems could then be combined to have a smooth workflow that would take the audio input, convert it to text, map the ASL signs and animate a character signing.
🌟 With more advancements in tech it would amazing if this could be used in augmented reality to have an animated deaf translator with you at all times
Conclusion
Millions of people could potentially benefit from this and it would surely make the world more accessible to all. I hope you’ve learnt as much as I through this series, have a nice day and keep being creative 👋