Binaural Localization of Multiple Sound Sources by Non-Negative Tensor Factorization

Abstract

This paper presents non-negative factorization of audio signals for the binaural localization of multiple sound sources within realistic and unknown sound environments. Non-negative tensor factorization (NTF) provides a sparse representation of multichannel audio signals in time, frequency, and space that can be exploited in computational audio scene analysis and robot audition for the separation and localization of sound sources. In the proposed formulation, each sound source is represented by means of spectral dictionaries, temporal activation, and its distribution within each channel (here, left and right ears). This distribution, being dependent on the frequency, can be interpreted as an explicit estimation of the Head-Related Transfer Function (HRTF) of a binaural head which can then be converted into the estimated sound source position. Moreover, the semisupervised formulation of the non-negative factorization allows us to integrate prior knowledge about some sound sources of interest whose dictionaries can be learned in advance, whereas the remaining sources are considered as background sound, which remains unknown and is estimated on the fly. The proposed NTF-based sound source localization is applied here to binaural sound source localization of multiple speakers within realistic sound environments.

Publication
in IEEE/ACM Transactions on Audio, Speech, and Language Processing