著者
Yoshiki Masuyama Tsubasa Kusano Kohei Yatabe Yasuhiro Oikawa
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.40, no.3, pp.186-197, 2019-05-01 (Released:2019-05-01)
参考文献数
29

For musical instrument sounds containing partials, which are referred to as modes, the decaying processes of the modes significantly affect the timbre of musical instruments and characterize the sounds. However, their accurate decomposition around the onset is not an easy task, especially when the sounds have sharp onsets and contain the non-modal percussive components such as the attack. This is because the sharp onsets of modes comprise peaky but broad spectra, which makes it difficult to get rid of the attack component. In this paper, an optimization-based method of modal decomposition is proposed to overcome it. The proposed method is formulated as a constrained optimization problem to enforce the perfect reconstruction property which is important for accurate decomposition and causality of modes. Three numerical simulations and application to the real piano sounds confirm the performance of the proposed method.
著者
Kohei Yatabe Yoshiki Masuyama Tsubasa Kusano Yasuhiro Oikawa
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.40, no.3, pp.170-177, 2019-05-01 (Released:2019-05-01)
参考文献数
48
被引用文献数
9

As importance of the phase of complex spectrogram has been recognized widely, many techniques have been proposed for handling it. However, several definitions and terminologies for the same concept can be found in the literature, which has confused beginners. In this paper, two major definitions of the short-time Fourier transform and their phase conventions are summarized to alleviate such complication. A phase-aware signal-processing scheme based on phase conversion is also introduced with a set of executable MATLAB functions (https://doi.org/10/c3qb).
著者
Kei Sawada Kei Hashimoto Keiichiro Oura Yoshihiko Nankaku Keiichi Tokuda
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.39, no.2, pp.119-129, 2018-03-01 (Released:2018-03-01)
参考文献数
35

This paper proposes a method for constructing text-to-speech (TTS) systems for languages with unknown pronunciations. One goal of speech synthesis research is to establish a framework that can be used to construct TTS systems for any written language. Generally, language-specific knowledge is required to construct TTS systems for a new language. However, it is difficult to acquire language-specific knowledge in each new language. Therefore, constructing a TTS system for a new language entails huge costs. To address this problem, we investigate a framework for automatically constructing a TTS system from a target language database consisting of only speech data and corresponding Unicode texts. In the proposed method, pseudo phonetic information of the target language with unknown pronunciation is obtained by a speech recognizer of a rich-resource proxy language. Then, a grapheme-to-phoneme converter and a statistical parametric speech synthesizer are constructed based on the obtained pseudo phonetic information. The proposed method was applied to Japanese and was evaluated in terms of objective and subjective measures. Additionally, we challenged the construction of TTS systems for nine Indian languages using the proposed method, and TTS systems were evaluated in the Blizzard Challenge 2014 and 2015.
著者
Daichi Kitamura
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.40, no.3, pp.155-161, 2019-05-01 (Released:2019-05-01)
参考文献数
35
被引用文献数
4

Nonnegative matrix factorization (NMF) is a powerful technique of extracting meaningful patterns from an observed matrix and has been used for many applications in the audio signal processing field. In this article, the principle of NMF and some extensions based on a complex generative model are reviewed. Also, their application to audio source separation is presented.
著者
Shinnosuke Takamichi Ryosuke Sonobe Kentaro Mitsui Yuki Saito Tomoki Koriyama Naoko Tanji Hiroshi Saruwatari
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.41, no.5, pp.761-768, 2020-09-01 (Released:2020-09-01)
参考文献数
50

In this paper, we develop two corpora for speech synthesis research. Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we aim at developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In this paper, we construct the JSUT and JVS corpora. They are designed mainly for text-to-speech synthesis and voice conversion, respectively. The JSUT corpus contains 10 hours of reading-style speech uttered by a single speaker, and the JVS corpus contains 30 hours containing three styles of speech uttered by 100 speakers. This paper describes how we designed the corpora and summarizes the specifications. The corpora are available at our project pages.
著者
Akira Nishimura Nobuo Koizumi
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.31, no.2, pp.172-180, 2010-03-01 (Released:2010-03-01)
参考文献数
15
被引用文献数
1

A method of sampling jitter measurement based on time-domain analytic signals is proposed. Computer simulations and actual measurements were performed to compare the proposed method with the conventional method, in which jitter is evaluated from the amplitudes of sideband spectra for observed signals in the frequency domain. The results show that the proposed method is effective in that it 1) provides high temporal resolution as a result of the direct derivation of the jitter waveform, 2) achieves higher accuracy in the measurement of jitter amplitude, and 3) can separate phase modulation that originate in sampling jitter from amplitude modulation that originate in digital-to-analog and analog-to-digital conversion processes. Suitable measurement conditions and measurements to separate the effects of jitter in a digital-to-analog converter and an analog-to-digital converter are described.
著者
Shoichi Koyama
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.41, no.1, pp.269-275, 2020-01-01 (Released:2020-01-06)
参考文献数
33

Estimating and interpolating a sound field from measurements using multiple microphones are fundamental tasks in sound field analysis for sound field reconstruction. The sound field reconstruction inside a source-free region is achieved by decomposing the sound field into plane-wave or harmonic functions. When the target region includes sources, it is necessary to impose some assumptions on the sources. Recently, it has been increasingly popular to apply sparse representation algorithms to various sound field analysis methods. In this paper, we present an overview of sparsity-based sound field reconstruction methods and also demonstrate their application to sound field recording and reproduction.
著者
Katsuhiko Yamamoto Toshio Irino Toshie Matsui Shoko Araki Keisuke Kinoshita Tomohiro Nakatani
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.40, no.2, pp.84-92, 2019-03-01 (Released:2019-03-01)
参考文献数
27
被引用文献数
2

The speech-based envelope power spectrum model (sEPSM) was developed to predict the speech intelligibility of sounds produced by nonlinear speech enhancement algorithms such as spectral subtraction. It is a linear model with a linear, level-independent gammatone (GT) filterbank as the front-end. Therefore, it seems difficult to evaluate speech sounds with low and high sound pressure levels (SPLs) consistently because the intelligibility of the speech is dependent on the SPL as well as the signal-to-noise ratio. In this study, the sEPSM was extended with the dynamic compressive gammachirp (dcGC) auditory filterbank and a ``common'' normalization factor of the modulation power spectrum component to improve the predictability of the model. For evaluating the proposed model, we performed subjective experiments on the intelligibility of speech sounds enhanced by spectral subtraction and a Wiener filter algorithm. We compared the subjective speech intelligibility scores with the objective scores predicted by the proposed dcGC-sEPSM, original GT-sEPSM, and other well-known conventional methods such as the short-time objective intelligibility measure (STOI), coherence speech intelligibility index (CSII), and hearing aid speech perception index (HASPI). The result shows that the proposed dcGC-sEPSM predicted the subjective results better did than the other methods.
著者
M. Charles Liberman
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.41, no.1, pp.59-62, 2020-01-01 (Released:2020-01-06)
参考文献数
42
被引用文献数
1 2

In acquired sensorineural hearing loss, the hearing impairment arises mainly from damage to cochlear hair cells or the sensory fibers of the auditory nerve that innervate them. Hair cell loss or damage is well captured by the changes in the threshold audiogram, but the degree of neural damage is not. We have recently shown, in animal models of noise-damage and aging, and in autopsy specimens from aging humans, that the synapses connecting inner hair cells and auditory nerve fibers are the first to degenerate. This primary neural degeneration, or cochlear synaptopathy, leaves many surviving inner hair cells permanently disconnected from their sensory innervation, and many spiral ganglion cells surviving with only their central projections to the brainstem intact. This pathology represents a kind of ``hidden hearing loss.'' This review summarizes current speculations as to the functional consequences of this primary neural degeneration and the prospects for a therapeutic rescue based on local delivery of neuroptrophins to elicit neurite extension and synaptogenesis in the adult ear.
著者
Masayuki Nishiguchi
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.27, no.6, pp.375-383, 2006 (Released:2006-11-01)
参考文献数
19

A coding algorithm for speech called harmonic vector excitation coding (HVXC) has been developed that encodes speech at very low bit rates (2.0–4.0 kbit/s). It breaks speech signals down into two types of segments: voiced segments, for which a parametric representation of harmonic spectral magnitudes of LPC residual signals is used; and unvoiced segments, for which the CELP coding algorithm is used. This combination provides near toll-quality speech at 4.0 kbit/s, and communication-quality speech at 2.0 kbit/s, thus outperforming FS1016 4.8-kbit/s CELP. This paper discusses the encoder and decoder algorithms for HVXC, including fast harmonic synthesis, time scale modification, and pitch-change decoding. Due to its high coding efficiency and new functionality, HVXC has been adopted as the ISO/IEC International Standard for MPEG-4 audio.
著者
Yuki Saito Taiki Nakamura Yusuke Ijima Kyosuke Nishida Shinnosuke Takamichi
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.42, no.1, pp.1-11, 2021-01-01 (Released:2021-01-01)
参考文献数
34

We propose non-parallel and many-to-many voice conversion (VC) using variational autoencoders (VAEs) that constructs VC models for converting arbitrary speakers' characteristics into those of other arbitrary speakers without parallel speech corpora for training the models. Although VAEs conditioned by one-hot coded speaker codes can achieve non-parallel VC, the phonetic contents of the converted speech tend to vanish, resulting in degraded speech quality. Another issue is that they cannot deal with unseen speakers not included in training corpora. To overcome these issues, we incorporate deep-neural-network-based automatic speech recognition (ASR) and automatic speaker verification (ASV) into the VAE-based VC. Since phonetic contents are given as phonetic posteriorgrams predicted from the ASR models, the proposed VC can overcome the quality degradation. Our VC utilizes d-vectors extracted from the ASV models as continuous speaker representations that can deal with unseen speakers. Experimental results demonstrate that our VC outperforms the conventional VAE-based VC in terms of mel-cepstral distortion and converted speech quality. We also investigate the effects of hyperparameters in our VC and reveal that 1) a large d-vector dimensionality that gives the better ASV performance does not necessarily improve converted speech quality, and 2) a large number of pre-stored speakers improves the quality.
著者
Eberhard Zwicker Hugo Fastl Ulrich Widmann Kenji Kurakata Sonoko Kuwano Seiichiro Namba
出版者
Acoustical Society of Japan
雑誌
Journal of the Acoustical Society of Japan (E) (ISSN:03882861)
巻号頁・発行日
vol.12, no.1, pp.39-42, 1991 (Released:2011-02-17)
参考文献数
12
被引用文献数
46 69

The method for calculating loudness level proposed by Zwicker is standardized in ISO 532B. This is a graphical procedure and it can be tedious to calculate loudness level by this procedure. Recently, DIN 45631 has been revised including a computer program for calculating loudness level in BASIC which runs on IBM-compatible PC's. Since the NEC PC-9801 series computers are popular in Japan, the program has been modified for the NEC PC-9801 series computers and is introduced in this paper.
著者
Daiki Takeuchi Kohei Yatabe Yuma Koizumi Yasuhiro Oikawa Noboru Harada
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.41, no.5, pp.769-775, 2020-09-01 (Released:2020-09-01)
参考文献数
39

In recent single-channel speech enhancement, deep neural network (DNN) has played a quite important role for achieving high performance. One standard use of DNN is to construct a mask-generating function for time-frequency (T-F) masking. For applying a mask in T-F domain, the short-time Fourier transform (STFT) is usually utilized because of its well-understood and invertible nature. While the mask-generating regression function has been studied for a long time, there is less research on T-F transform from the viewpoint of speech enhancement. Since the performance of speech enhancement depends on both the T-F mask estimator and T-F transform, investigating T-F transform should be beneficial for designing a better enhancement system. In this paper, as a step toward optimal T-F transform in terms of speech enhancement, we experimentally investigated the effect of parameter settings of STFT on a DNN-based mask estimator. We conducted the experiments using three types of DNN architectures with three types of loss functions, and the results suggested that U-Net is robust to the parameter setting while that is not the case for fully connected and BLSTM networks.