著者
Naoya Murashima Hirokazu Kameoka Li Li Shogo Seki Shoji Makino
出版者
Research Institute of Signal Processing, Japan
雑誌
Journal of Signal Processing (ISSN:13426230)
巻号頁・発行日
vol.25, no.4, pp.145-149, 2021-07-01 (Released:2021-07-01)
参考文献数
24

This paper deals with single-channel speaker-dependent speech separation. While discriminative approaches using deep neural networks (DNNs) have recently proved powerful, generative approaches, including methods based on non-negative matrix factorization (NMF), are still attractive because of their flexibility in handling the mismatch between training and test conditions. Although NMF-based methods work reasonably well for particular sound sources, one limitation is that they can fail to work for sources with spectrograms that do not comply with the NMF model. To address this problem, attempts have recently been made to replace the NMF model with DNNs. With a similar motivation to these attempts, we propose in this paper a variational autoencoder (VAE)-based monaural source separation (VASS) method using a conditional VAE (CVAE) for source spectrogram modeling. We further propose an extension of the VASS method, called the discriminative VASS (DVASS) method, which uses a discriminative criterion for model training so that the separated signals directly become optimal. Experimental results revealed that the VASS method performed better than an NMF-based method, and the DVASS method performed better than the VASS method.
著者
Shogo SEKI Tomoki TODA Kazuya TAKEDA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences (ISSN:09168508)
巻号頁・発行日
vol.E101-A, no.7, pp.1057-1064, 2018-07-01

This paper proposes a semi-supervised source separation method for stereophonic music signals containing multiple recorded or processed signals, where synthesized music is focused on the stereophonic music. As the synthesized music signals are often generated as linear combinations of many individual source signals and their respective mixing gains, phase or phase difference information between inter-channel signals, which represent spatial characteristics of recording environments, cannot be utilized as acoustic clues for source separation. Non-negative Tensor Factorization (NTF) is an effective technique which can be used to resolve this problem by decomposing amplitude spectrograms of stereo channel music signals into basis vectors and activations of individual music source signals, along with their corresponding mixing gains. However, it is difficult to achieve sufficient separation performance using this method alone, as the acoustic clues available for separation are limited. To address this issue, this paper proposes a Cepstral Distance Regularization (CDR) method for NTF-based stereo channel separation, which involves making the cepstrum of the separated source signals follow Gaussian Mixture Models (GMMs) of the corresponding the music source signal. These GMMs are trained in advance using available samples. Experimental evaluations separating three and four sound sources are conducted to investigate the effectiveness of the proposed method in both supervised and semi-supervised separation frameworks, and performance is also compared with that of a conventional NTF method. Experimental results demonstrate that the proposed method yields significant improvements within both separation frameworks, and that cepstral distance regularization provides better separation parameters.