Demo: audio classification with the Audio Spectrogram Transformer

Julien Simon - Feb 23 '23 - - Dev Community

Multi-modal transformers are rising fast. A great example is the Audio Spectrogram Transformer, an audio classification model that was just added to the Hugging Face Transformers library. This model first creates a spectrogram image of an audio clip and then classifies the image with a Vision Transformer model. Amazing results!

✅ Spaces demo: https://huggingface.co/spaces/juliensimon/keyword-spotting

✅ Model: https://huggingface.co/MIT/ast-finetuned-speech-commands-v2

✅ Paper: https://arxiv.org/abs/2104.01778

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .