Voice manipulation detection system trained on ASVspoof2019 LA. Three architectures compared on 43% of the dataset — BiLSTM, CNN, Transformer — then the winning Transformer retrained on the full dataset. Final model: 87.11% accuracy, EER 26.76%, F1 0.926.
Voice spoofing attacks — TTS synthesis, voice conversion, replay — are a direct threat to speaker verification systems used in biometric authentication. This project implements a binary classifier that distinguishes bonafide human speech from synthetic or manipulated audio.
The project ran in two phases: first, all three architectures were trained on 43% of the dataset to identify the strongest approach. Then the winning Transformer was retrained on the full dataset, improving EER from 34.96% to 26.76%.
Fixed input window: 64,000 samples (4 seconds at 16 kHz). Features precomputed and cached — second epoch training is ~10× faster.
| Model | Parameters | Val Accuracy | Val EER | Test Accuracy | Test EER | F1 |
|---|---|---|---|---|---|---|
| CNN — input conv + 3 residual blocks + BatchNorm | ~360K | ~89.6% | 0.0064 | 28.0% | 0.3485 | 0.330 |
| BiLSTM — 2-layer + attention pooling | ~656K | ~98.8% | 0.0186 | 76.1% | 0.3500 | 0.853 |
| Transformer — 2-layer encoder + CLS Selected | ~282K | ~90.2% | 0.0466 | 86.5% | 0.3496 | 0.922 |
The CNN achieved val EER 0.0064 but collapsed to test EER 0.3485 — a 54× generalization gap. It predicts nearly everything as bonafide on unseen data (F1 0.330). The BiLSTM overfits similarly (val EER 0.0186 → test 0.3500). The Transformer's val EER of 0.0466 is higher, but its test EER of 0.3496 shows the smallest gap — 7.5× vs 54× for CNN. SpecAugment regularization is the deciding factor.
EER improved from 34.96% → 26.76% by training on the full dataset. Dev set EER: 1.80%. Eval set EER: 26.76% — the generalization gap remains (dev accuracy 97.40% vs eval 87.11%) but is substantially better than the 43% run.
| Predicted Bonafide | Predicted Spoof | |
|---|---|---|
| Actual Bonafide | 4,754 | 2,601 |
| Actual Spoof | 6,582 | 57,300 |
The score distribution plots reveal how each model uses its output range. The CNN concentrates all predictions near 0 — it has essentially learned a degenerate classifier that calls everything bonafide. The BiLSTM pushes scores toward 0.65+ but without clean separation. The Transformer produces a bimodal distribution: bonafide scores cluster near 0, spoof scores cluster near 0.65, with the 0.5 decision threshold sitting in the gap. This is the only model with actual discriminative behavior.
A desktop GUI (src/app.py) wraps the production Transformer for real-time detection. Supports direct microphone recording and file input (WAV, FLAC, MP3, OGG, M4A, AAC). Auto-resampling to 16 kHz mono, peak normalization, LFCC extraction, inference with P(spoof) score. CPU inference ~200ms per prediction.
# Run inference app python src/app.py # Generate TTS attack samples to verify spoof detection python src/gen_tts.py # Decision boundary P(spoof) > 0.5 → SPOOF DETECTED P(spoof) ≤ 0.5 → AUTHENTIC
Fixed 4-second input window — longer audio truncated, shorter zero-padded. No streaming or continuous real-time analysis. Mono only (stereo auto-downmixed). EER of 26.76% on the full eval set reflects a real generalization gap between the dev and eval distributions in ASVspoof2019 — this is a known challenge in the anti-spoofing literature, not an implementation error. The dev set EER of 1.80% demonstrates the model learns the task well; the eval set contains attack types with different spectral characteristics.
ASVspoof2019 Logical Access (LA). Phase 1 training used 43% subset (10,913 samples: 1,078 bonafide / 9,835 spoof). Phase 2 used the full training set. Validation: full dev set (10,683 samples). Evaluation: full eval set (71,237 samples, n=7,355 bonafide / n=63,882 spoof) covering 13 TTS and voice conversion algorithms (A01–A19).