Blind Audio Source Separation - Speech

We deal with the case where the sources are linearly mixed and the mixtures are underdetermined. Hence, A has more columns than rows. Sparsity of the sources is vital for good separation. Bayesian methods such as the Gibbs Sampler (a standard MCMC simulation method) are used to estimate the sources and the mixing matrix in the presence of noise.

I.I.D. Gaussian noise was added to the observations, which resulted in an SNR of about 16 dB. The mixing matrix used is given by A = [0.4000 0.8315 0.5657; -0.6928 -0.3444 0.5657].

2) Speech Signals

These are the 3 speech signals.
Speech Signal 1
"While the Jeffers had reached its limit, it was now mid-August, which meant he had been separated from Marshall from..."
Speech Signal 2
"From the playground of the world, there was no... place like it, in the whole world, like coning out when I was a youngster."
Speech Signal 3
"Get ready for the dynamic impact test, where we'll really put your audio system through its paces to show you what it can do."

These are the 2 mixtures.
Mixture 1
Mixture 2

Sparsity Indices for various transform types.

Click on thumbnails for larger (and clearer) versions.

Transform Types
1	2	3	4	5	6
DCT	MDCT	WT (Vai)	WT (Sym8)	WPBB	No Transform

Results at a glance.

Performance Measures
1	2	3	4
SDR	SIR	SAR	SNR

2.1) Discrete Cosine Transform

Reconstructed Speech Signal 1
Reconstructed Speech Signal 2
Reconstructed Speech Signal 3

Virtually no separation even after convergence was observed.

2.2) Modified Discrete Cosine Transform

Reconstructed Speech Signal 1
Reconstructed Speech Signal 2
Reconstructed Speech Signal 3

Very good separation.
Signals 1 and 3 contain noticeable but limited anomalous artifacts.
But, even the chirping of the birds can be heard clearly in Signal 2.

2.3) Wavelet Transform: Vaidyanathan

Reconstructed Speech Signal 1
Reconstructed Speech Signal 2
Reconstructed Speech Signal 3

Wavelets are not the right basis to use for separation of speech signals.
Very distinct artifacts.

2.4) Wavelet Transform: Symmlet 8

Reconstructed Speech Signal 1
Reconstructed Speech Signal 2
Reconstructed Speech Signal 3

Wavelets are not the right basis to use for separation of speech signals.
Very distinct artifacts.

2.5) Wavelet Packet Best Basis

Reconstructed Speech Signal 1
Reconstructed Speech Signal 2
Reconstructed Speech Signal 3

An adaptive basis does very well but not as well as the MDCT in terms of noise suppression.
You have to listen really carefully before you can come to a conclusion that this is not as good as the MDCT.

2.6) No Transform

Reconstructed Speech Signal 1
Reconstructed Speech Signal 2
Reconstructed Speech Signal 3

Surprisingly, applying the Gibbs Sampler on the sources directly, without performing the transform, also results in some separation.

Back
Next
Home

Server at www.eng.cam.ac.uk