An In-Depth Look at Audio Classification Using CNNs and Transformers

Aditi Baheti - Jul 3 - - Dev Community

Introduction

Audio classification is a fascinating area of machine learning that involves categorizing audio signals into predefined classes. In this blog, we will delve into the specifics of an audio classification project, exploring the architectures, methodologies, and results obtained from experimenting with Convolutional Neural Networks (CNNs) and Transformers.

Dataset

The project utilized the ESC-50 dataset, a compilation of environmental audio clips categorized into 50 different classes. Specifically, the ESC-10 subset was used, narrowing the dataset to 10 categories for more focused experimentation.

Architecture 1: Convolutional Neural Networks (CNNs)

Initial Setup

The initial model setup for audio classification relied heavily on CNNs. These networks use convolutional layers to extract features from the audio signals progressively, increasing the output channel size from 16 to 64. Each convolutional layer is followed by a max-pooling layer to reduce the spatial dimensions and highlight the most critical features.

Original Model

The original model focused solely on feature extraction without incorporating dropout, early stopping, or other regularization techniques. This led to a basic yet effective structure for understanding audio data's complex patterns.

Enhanced Model

To combat overfitting and improve generalization, several enhancements were made:

  • Dropout: Introduced to randomly deactivate neurons during training, thereby preventing over-reliance on specific paths.
  • Early Stopping: Implemented to halt training when validation performance plateaued, ensuring the model does not overfit to the training data.
  • Regularization: Additional techniques were employed to further stabilize the training process and enhance generalization.

Results

The use of k-fold cross-validation, with fold 1 reserved for validation, provided a comprehensive evaluation of the model's performance. Key observations from hyperparameter tuning include:

  • Reduced Overfitting: The enhanced model exhibited lower test losses and higher test accuracies, F1 scores, and ROC AUC values across all folds compared to the original model.

The following table summarizes the performance across different folds:

Metric Fold 2 (Original) Fold 2 (Enhanced) Fold 3 (Original) Fold 3 (Enhanced) Fold 4 (Original) Fold 4 (Enhanced) Fold 5 (Original) Fold 5 (Enhanced)
Avg. Training Accuracy 63.49% 51.15% 68.77% 43.67% 68.64% 55.49% 67.55% 49.84%
Avg. Validation Accuracy 34.25% 38.42% 39.17% 35.00% 38.54% 40.64% 38.44% 43.97%
Test Loss 7.7658 1.5196 4.4111 1.4217 4.1973 1.5789 4.4777 1.5499
Test Accuracy 30.42% 48.47% 42.08% 45.97% 40.56% 43.47% 45.69% 42.92%
F1 Score 0.26 0.47 0.40 0.45 0.41 0.42 0.44 0.39
ROC AUC 0.72 0.88 0.81 0.88 0.78 0.87 0.80 0.86

Confusion Matrix and ROC Curve

The confusion matrix and ROC curve for the best performing fold (Fold 2) highlight the classifier's ability to distinguish between most classes effectively. However, there are instances of misclassification, suggesting the need for further refinement in the model.

Architecture 2: Transformers

Transformers, known for their success in natural language processing, were adapted for audio classification in this project. The core of this architecture involves:

  • Convolutional Layers: Used initially to extract basic audio features such as tones and rhythms.
  • Transformer Blocks: Employed to process these features using attention mechanisms, enabling the model to focus on different parts of the audio sequence dynamically.
  • Multi-Head Attention: Utilized to attend to various representation subspaces simultaneously, enhancing the model's interpretive capabilities.
  • Positional Encodings: Incorporated to retain the sequential order of audio data, allowing the model to adapt positional information effectively.

Performance Metrics

The transformer model was evaluated with different numbers of attention heads (1, 2, and 4). Key observations include:

  • Two Heads Model: This configuration outperformed others in terms of test accuracy and F1 score, suggesting an optimal balance between feature learning and generalization.
  • Four Heads Model: Despite higher train accuracy, this model exhibited signs of overfitting, with less effective feature integration for classification.

The table below outlines the performance metrics for different configurations:

Number of Heads Train Accuracy Valid Accuracy Test Accuracy Train Loss Valid Loss Test Loss F1 Score ROC AUC
1 Head 80.74% 46.39% 43.47% 0.5412 2.5903 2.9106 0.41 0.82
2 Heads 79.91% 49.86% 49.86% 0.5778 2.4115 2.4757 0.47 0.86
4 Heads 81.71% 44.86% 42.78% 0.5759 2.6297 2.4895 0.40 0.84

Enhanced Model with Transformers

The enhanced model employed additional techniques such as gradient clipping and the AdamW optimizer, coupled with a learning rate scheduler. This configuration significantly improved the model's stability and generalization capabilities.

  • Gradient Clipping: Applied to prevent exploding gradients, ensuring stable training.
  • AdamW Optimizer: Recognized for its weight decay regularization, enhancing the model's performance on validation data.

The enhanced model demonstrated superior performance across several metrics:

Metric Enhanced Model
Train Accuracy 79.81%
Validation Accuracy 55.00%
Test Accuracy 58.19%
Train Loss 0.6030
Validation Loss 1.5191
Test Loss 1.1435
F1 Score 0.56
ROC AUC 0.93

Trainable Parameters

  • SoundClassifier: Approximately 16.4 million trainable parameters.
  • AudioClassifierWithTransformer: About 8.9 million trainable parameters.

Conclusion

This project illustrates the potential of both CNNs and Transformers in audio classification tasks. While CNNs provide a solid foundation for feature extraction, Transformers offer advanced capabilities through attention mechanisms, enhancing the model's ability to interpret complex audio signals. By incorporating regularization techniques and advanced optimizers, the enhanced models achieved significant improvements in generalization and stability, highlighting the importance of these strategies in machine learning.

The results underscore the effectiveness of using a combination of traditional convolutional methods and modern transformer architectures to tackle the challenges of audio classification, paving the way for further innovations in this exciting field.


. . . .