Introduction
Audio classification is a fascinating area of machine learning that involves categorizing audio signals into predefined classes. In this blog, we will delve into the specifics of an audio classification project, exploring the architectures, methodologies, and results obtained from experimenting with Convolutional Neural Networks (CNNs) and Transformers.
Dataset
The project utilized the ESC-50 dataset, a compilation of environmental audio clips categorized into 50 different classes. Specifically, the ESC-10 subset was used, narrowing the dataset to 10 categories for more focused experimentation.
Architecture 1: Convolutional Neural Networks (CNNs)
Initial Setup
The initial model setup for audio classification relied heavily on CNNs. These networks use convolutional layers to extract features from the audio signals progressively, increasing the output channel size from 16 to 64. Each convolutional layer is followed by a max-pooling layer to reduce the spatial dimensions and highlight the most critical features.
Original Model
The original model focused solely on feature extraction without incorporating dropout, early stopping, or other regularization techniques. This led to a basic yet effective structure for understanding audio data's complex patterns.
Enhanced Model
To combat overfitting and improve generalization, several enhancements were made:
- Dropout: Introduced to randomly deactivate neurons during training, thereby preventing over-reliance on specific paths.
- Early Stopping: Implemented to halt training when validation performance plateaued, ensuring the model does not overfit to the training data.
- Regularization: Additional techniques were employed to further stabilize the training process and enhance generalization.
Results
The use of k-fold cross-validation, with fold 1 reserved for validation, provided a comprehensive evaluation of the model's performance. Key observations from hyperparameter tuning include:
- Reduced Overfitting: The enhanced model exhibited lower test losses and higher test accuracies, F1 scores, and ROC AUC values across all folds compared to the original model.
The following table summarizes the performance across different folds:
Metric | Fold 2 (Original) | Fold 2 (Enhanced) | Fold 3 (Original) | Fold 3 (Enhanced) | Fold 4 (Original) | Fold 4 (Enhanced) | Fold 5 (Original) | Fold 5 (Enhanced) |
---|---|---|---|---|---|---|---|---|
Avg. Training Accuracy | 63.49% | 51.15% | 68.77% | 43.67% | 68.64% | 55.49% | 67.55% | 49.84% |
Avg. Validation Accuracy | 34.25% | 38.42% | 39.17% | 35.00% | 38.54% | 40.64% | 38.44% | 43.97% |
Test Loss | 7.7658 | 1.5196 | 4.4111 | 1.4217 | 4.1973 | 1.5789 | 4.4777 | 1.5499 |
Test Accuracy | 30.42% | 48.47% | 42.08% | 45.97% | 40.56% | 43.47% | 45.69% | 42.92% |
F1 Score | 0.26 | 0.47 | 0.40 | 0.45 | 0.41 | 0.42 | 0.44 | 0.39 |
ROC AUC | 0.72 | 0.88 | 0.81 | 0.88 | 0.78 | 0.87 | 0.80 | 0.86 |
Confusion Matrix and ROC Curve
The confusion matrix and ROC curve for the best performing fold (Fold 2) highlight the classifier's ability to distinguish between most classes effectively. However, there are instances of misclassification, suggesting the need for further refinement in the model.
Architecture 2: Transformers
Transformers, known for their success in natural language processing, were adapted for audio classification in this project. The core of this architecture involves:
- Convolutional Layers: Used initially to extract basic audio features such as tones and rhythms.
- Transformer Blocks: Employed to process these features using attention mechanisms, enabling the model to focus on different parts of the audio sequence dynamically.
- Multi-Head Attention: Utilized to attend to various representation subspaces simultaneously, enhancing the model's interpretive capabilities.
- Positional Encodings: Incorporated to retain the sequential order of audio data, allowing the model to adapt positional information effectively.
Performance Metrics
The transformer model was evaluated with different numbers of attention heads (1, 2, and 4). Key observations include:
- Two Heads Model: This configuration outperformed others in terms of test accuracy and F1 score, suggesting an optimal balance between feature learning and generalization.
- Four Heads Model: Despite higher train accuracy, this model exhibited signs of overfitting, with less effective feature integration for classification.
The table below outlines the performance metrics for different configurations:
Number of Heads | Train Accuracy | Valid Accuracy | Test Accuracy | Train Loss | Valid Loss | Test Loss | F1 Score | ROC AUC |
---|---|---|---|---|---|---|---|---|
1 Head | 80.74% | 46.39% | 43.47% | 0.5412 | 2.5903 | 2.9106 | 0.41 | 0.82 |
2 Heads | 79.91% | 49.86% | 49.86% | 0.5778 | 2.4115 | 2.4757 | 0.47 | 0.86 |
4 Heads | 81.71% | 44.86% | 42.78% | 0.5759 | 2.6297 | 2.4895 | 0.40 | 0.84 |
Enhanced Model with Transformers
The enhanced model employed additional techniques such as gradient clipping and the AdamW optimizer, coupled with a learning rate scheduler. This configuration significantly improved the model's stability and generalization capabilities.
- Gradient Clipping: Applied to prevent exploding gradients, ensuring stable training.
- AdamW Optimizer: Recognized for its weight decay regularization, enhancing the model's performance on validation data.
The enhanced model demonstrated superior performance across several metrics:
Metric | Enhanced Model |
---|---|
Train Accuracy | 79.81% |
Validation Accuracy | 55.00% |
Test Accuracy | 58.19% |
Train Loss | 0.6030 |
Validation Loss | 1.5191 |
Test Loss | 1.1435 |
F1 Score | 0.56 |
ROC AUC | 0.93 |
Trainable Parameters
- SoundClassifier: Approximately 16.4 million trainable parameters.
- AudioClassifierWithTransformer: About 8.9 million trainable parameters.
Conclusion
This project illustrates the potential of both CNNs and Transformers in audio classification tasks. While CNNs provide a solid foundation for feature extraction, Transformers offer advanced capabilities through attention mechanisms, enhancing the model's ability to interpret complex audio signals. By incorporating regularization techniques and advanced optimizers, the enhanced models achieved significant improvements in generalization and stability, highlighting the importance of these strategies in machine learning.
The results underscore the effectiveness of using a combination of traditional convolutional methods and modern transformer architectures to tackle the challenges of audio classification, paving the way for further innovations in this exciting field.