Audio Spoofing Detection — Soufiane Tahiri

Overview

Voice spoofing attacks — TTS synthesis, voice conversion, replay — are a direct threat to speaker verification systems used in biometric authentication. This project implements a binary classifier that distinguishes bonafide human speech from synthetic or manipulated audio.

The project ran in two phases: first, all three architectures were trained on 43% of the dataset to identify the strongest approach. Then the winning Transformer was retrained on the full dataset, improving EER from 34.96% to 26.76%.

Feature Pipeline

Raw Audio → Resample 16 kHz mono → Peak Normalization → LFCC (60 coeff, 512-pt FFT) → Cache to disk → [batch, 401, 60]

Fixed input window: 64,000 samples (4 seconds at 16 kHz). Features precomputed and cached — second epoch training is ~10× faster.

Phase 1 — Architecture Comparison (43% Dataset)

Model	Parameters	Val Accuracy	Val EER	Test Accuracy	Test EER	F1
CNN — input conv + 3 residual blocks + BatchNorm	~360K	~89.6%	0.0064	28.0%	0.3485	0.330
BiLSTM — 2-layer + attention pooling	~656K	~98.8%	0.0186	76.1%	0.3500	0.853
Transformer — 2-layer encoder + CLS Selected	~282K	~90.2%	0.0466	86.5%	0.3496	0.922

Why the Transformer won despite worse val EER

The CNN achieved val EER 0.0064 but collapsed to test EER 0.3485 — a 54× generalization gap. It predicts nearly everything as bonafide on unseen data (F1 0.330). The BiLSTM overfits similarly (val EER 0.0186 → test 0.3500). The Transformer's val EER of 0.0466 is higher, but its test EER of 0.3496 shows the smallest gap — 7.5× vs 54× for CNN. SpecAugment regularization is the deciding factor.

Phase 2 — Full Dataset Retrain (Transformer)

87.11%

Accuracy

0.957

Precision

0.897

Recall

0.926

F1 Score

EER improved from 34.96% → 26.76% by training on the full dataset. Dev set EER: 1.80%. Eval set EER: 26.76% — the generalization gap remains (dev accuracy 97.40% vs eval 87.11%) but is substantially better than the 43% run.

Confusion Matrix — Full Eval Set (71,237 samples)

	Predicted Bonafide	Predicted Spoof
Actual Bonafide	4,754	2,601
Actual Spoof	6,582	57,300

Score Distribution Interpretation

The score distribution plots reveal how each model uses its output range. The CNN concentrates all predictions near 0 — it has essentially learned a degenerate classifier that calls everything bonafide. The BiLSTM pushes scores toward 0.65+ but without clean separation. The Transformer produces a bimodal distribution: bonafide scores cluster near 0, spoof scores cluster near 0.65, with the 0.5 decision threshold sitting in the gap. This is the only model with actual discriminative behavior.

Inference Application

A desktop GUI (src/app.py) wraps the production Transformer for real-time detection. Supports direct microphone recording and file input (WAV, FLAC, MP3, OGG, M4A, AAC). Auto-resampling to 16 kHz mono, peak normalization, LFCC extraction, inference with P(spoof) score. CPU inference ~200ms per prediction.

# Run inference app
python src/app.py

# Generate TTS attack samples to verify spoof detection
python src/gen_tts.py

# Decision boundary
P(spoof) > 0.5  →  SPOOF DETECTED
P(spoof) ≤ 0.5  →  AUTHENTIC

Known Limitations

Fixed 4-second input window — longer audio truncated, shorter zero-padded. No streaming or continuous real-time analysis. Mono only (stereo auto-downmixed). EER of 26.76% on the full eval set reflects a real generalization gap between the dev and eval distributions in ASVspoof2019 — this is a known challenge in the anti-spoofing literature, not an implementation error. The dev set EER of 1.80% demonstrates the model learns the task well; the eval set contains attack types with different spectral characteristics.

Dataset

ASVspoof2019 Logical Access (LA). Phase 1 training used 43% subset (10,913 samples: 1,078 bonafide / 9,835 spoof). Phase 2 used the full training set. Validation: full dev set (10,683 samples). Evaluation: full eval set (71,237 samples, n=7,355 bonafide / n=63,882 spoof) covering 13 TTS and voice conversion algorithms (A01–A19).

AudioSpoofing Detection

Audio
Spoofing Detection