SHT+PEVD Separation Demo

Demonstration: Informed Source Separation

Anechoic speech signals sampled at 16 kHz are taken from the TIMIT corpus [1]. The room impulses responses (RIRs) from the 2 sources to 32 microphones on a rigid sphere are simulated using the SMIRgen tool [2]. The T60 is varied b 0 s, 0.3 s and 0.7 s.

For each speech source, short utterances from a randomly selected speaker are concatenated to generate signals of 8 to 10 s duration. Each source signal was convolved with the RIR and mixed along with 30 dB sensor noise to generate the microphone signals. Eigenbeam signals are generated using the spherical harmonic (SH) transform [3,4]. The modal beamformers are directed at (90,90) in degrees for Source 1 and (90,90) in degrees for Source 2. Two microphones closest to each source were chosen for the comparative methods. The sequential matrix diagonalisation (SMD) algorithm [5] is used for all PEVD processing.

The algorithms used in the comparison include

Auxiliary function-based IVA (AuxIVA) [6]
AuxIVA using 32 microphone signals instead of 2 (AuxIVA32) [6]
Fast Multi-channel Non-negative Matrix Factorisation (FastMNMF) [7]
Independent Low-Rank Matrix Analysis (ILRMA) [8]
3rd Order Hypercardioid Modal Beamformer Optimized for Maximum Directivity (MaxDir) [9]
3rd Order Modified Hypercardioid Modal Beamformer (MHCARD) [10]
PEVD which uses 4 beamformed signals per source. They are χ_1^-1, χ_3^-3, χ_3^-1 and modified hyper-cardioid (MHCARD) directed at Source 1, χ_1^1, χ_3^1, χ_3^3 and modified hyper-cardioid directed at Source 2. [10]

The audio player is built using the trackswitch.js tool [11].

Anechoic
T60=0.3s
T60=0.7s

2 Females

Two females speakers in an anechoic room.

2 Males
Speech/Fan

Two male speakers in reverberant room with T60 = 0.3 s.

One speaker and localized fan noise in reverberant room with T60 = 0.3 s.

Male/Female

One male and female speakers in reverberant room with T60 = 0.7 s.

References

[1] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallet, N. L. Dahlgren, and V. Zue, "TIMIT acoustic-phonetic continuous speech corpus," Linguistic Data Consortium (LDC), Philadelphia, Corpus, 1993.

[2] D. P. Jarrett, E. A. P. Habets, M. R. P. Thomas and P. A. Naylor, “Rigid sphere room impulse response simulation: Algorithm and applications,” J. Acoust. Soc. Am., vol. 132, no. 3, pp. 1462-1472, Sep. 2012.

[3] B. Rafaely, Fundamentals of Spherical Array Processing, Springer Topics in Signal Processing. Springer-Verlag, 2015.

[4] D. P. Jarrett, E. A. P. Habets, and P. A. Naylor, Theory and Applications of Spherical Microphone Array Processing, Springer Topics in Signal Processing. Springer-Verlag, 2017.

[5] S. Redif, S. Weiss, and J. G. McWhirter, “Sequential matrix diagonalisation algorithms for polynomial EVD of para-Hermitian matrices,” IEEE Trans. Signal Process., vol. 63, no. 1, pp. 81-89, Jan. 2015.

[6] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA), Oct. 2011, pp. 189-192.

[7] K. Sekiguchi, A. A. Nugraha, Y. Bando, and K. Yoshii, “Fast multichannel source separation based on jointly diagonalizable spatial covariance matrices,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2019.

[8] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari,” Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE / ACM Trans. on Audio, Speech and Language Process., vol. 24, no. 9, pp. 1626-1641, Sept. 2016.

[9] J. Meyer, “A highly scalable spherical microphone array based on orthonormal decomposition of the soundfield,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), May 2002, pp. 1781-1784.

[10] V. W. Neo, C. Evers, and P. A. Naylor, “Polynomial matrix eigenvalue decomposition-based source separation using informed spherical microphone arrays,” in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoust. (WASPAA), Oct. 2021, pp. 201-205.

Listening examples audio tool

[11] N. Werner, S. Balke, F.-R. Stöter, M. Müller, B. Edler, "trackswitch.js: A Versatile Web-Based Audio Player for Presenting Scientific Results." 3rd web audio conference, London, UK. 2017. [Online]. Available: https://github.com/audiolabs/trackswitch.js

PEVD-Based Algorithms for Speech Enhancement

[12] V. W. Neo, C. Evers, and P. A. Naylor, “Enhancement of noisy reverberant speech using polynomial matrix eigenvalue decomposition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3255-3266, Oct. 2021.

Related Works on PEVD Algorithms

[13] J. G. McWhirter, P. D. Baxter, T. Cooper, S. Redif, and J. Foster, “An EVD algorithm for para-Hermitian polynomial matrices,” IEEE Trans. Signal Process., vol. 55, no. 5, pp. 2158–2169, May 2007.

[14] V. W. Neo and P. A. Naylor, “Second order sequential best rotation algorithm with Householder transformation for polynomial matrix eigenvalue decomposition,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), 2019.

[15] S. Redif, S. Weiss, and J. G. McWhirter, “An approximate polynomial matrix eigenvalue decomposition algorithm for para-Hermitian matrices,” in Proc. Intl. Symp. on Signal Process. and Inform. Technology (ISSPIT), 2011, pp. 421–425.