01 · Deep Learning · Security · Complete

Audio
Spoofing Detection

Voice manipulation detection system trained on ASVspoof2019 LA. Three architectures compared on 43% of the dataset — BiLSTM, CNN, Transformer — then the winning Transformer retrained on the full dataset. Final model: 87.11% accuracy, EER 26.76%, F1 0.926.

PythonPyTorch ASVspoof2019LFCC BiLSTMCNN TransformerSpecAugment
Overview

Voice spoofing attacks — TTS synthesis, voice conversion, replay — are a direct threat to speaker verification systems used in biometric authentication. This project implements a binary classifier that distinguishes bonafide human speech from synthetic or manipulated audio.

The project ran in two phases: first, all three architectures were trained on 43% of the dataset to identify the strongest approach. Then the winning Transformer was retrained on the full dataset, improving EER from 34.96% to 26.76%.

Feature Pipeline
Raw Audio Resample 16 kHz mono Peak Normalization LFCC (60 coeff, 512-pt FFT) Cache to disk [batch, 401, 60]

Fixed input window: 64,000 samples (4 seconds at 16 kHz). Features precomputed and cached — second epoch training is ~10× faster.

Phase 1 — Architecture Comparison (43% Dataset)
Model Parameters Val Accuracy Val EER Test Accuracy Test EER F1
CNN — input conv + 3 residual blocks + BatchNorm ~360K ~89.6% 0.0064 28.0% 0.3485 0.330
BiLSTM — 2-layer + attention pooling ~656K ~98.8% 0.0186 76.1% 0.3500 0.853
Transformer — 2-layer encoder + CLS Selected ~282K ~90.2% 0.0466 86.5% 0.3496 0.922
Why the Transformer won despite worse val EER

The CNN achieved val EER 0.0064 but collapsed to test EER 0.3485 — a 54× generalization gap. It predicts nearly everything as bonafide on unseen data (F1 0.330). The BiLSTM overfits similarly (val EER 0.0186 → test 0.3500). The Transformer's val EER of 0.0466 is higher, but its test EER of 0.3496 shows the smallest gap — 7.5× vs 54× for CNN. SpecAugment regularization is the deciding factor.

Phase 2 — Full Dataset Retrain (Transformer)
87.11%
Accuracy
0.957
Precision
0.897
Recall
0.926
F1 Score

EER improved from 34.96% → 26.76% by training on the full dataset. Dev set EER: 1.80%. Eval set EER: 26.76% — the generalization gap remains (dev accuracy 97.40% vs eval 87.11%) but is substantially better than the 43% run.

Confusion Matrix — Full Eval Set (71,237 samples)
Predicted Bonafide Predicted Spoof
Actual Bonafide 4,754 2,601
Actual Spoof 6,582 57,300
Score Distribution Interpretation

The score distribution plots reveal how each model uses its output range. The CNN concentrates all predictions near 0 — it has essentially learned a degenerate classifier that calls everything bonafide. The BiLSTM pushes scores toward 0.65+ but without clean separation. The Transformer produces a bimodal distribution: bonafide scores cluster near 0, spoof scores cluster near 0.65, with the 0.5 decision threshold sitting in the gap. This is the only model with actual discriminative behavior.

Inference Application

A desktop GUI (src/app.py) wraps the production Transformer for real-time detection. Supports direct microphone recording and file input (WAV, FLAC, MP3, OGG, M4A, AAC). Auto-resampling to 16 kHz mono, peak normalization, LFCC extraction, inference with P(spoof) score. CPU inference ~200ms per prediction.

# Run inference app
python src/app.py

# Generate TTS attack samples to verify spoof detection
python src/gen_tts.py

# Decision boundary
P(spoof) > 0.5  →  SPOOF DETECTED
P(spoof) ≤ 0.5  →  AUTHENTIC
Known Limitations

Fixed 4-second input window — longer audio truncated, shorter zero-padded. No streaming or continuous real-time analysis. Mono only (stereo auto-downmixed). EER of 26.76% on the full eval set reflects a real generalization gap between the dev and eval distributions in ASVspoof2019 — this is a known challenge in the anti-spoofing literature, not an implementation error. The dev set EER of 1.80% demonstrates the model learns the task well; the eval set contains attack types with different spectral characteristics.

Dataset

ASVspoof2019 Logical Access (LA). Phase 1 training used 43% subset (10,913 samples: 1,078 bonafide / 9,835 spoof). Phase 2 used the full training set. Validation: full dev set (10,683 samples). Evaluation: full eval set (71,237 samples, n=7,355 bonafide / n=63,882 spoof) covering 13 TTS and voice conversion algorithms (A01–A19).