著者
Itsuki Ogawa Masanori Morise
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.42, no.3, pp.140-145, 2021-05-01 (Released:2021-05-01)
参考文献数
39
被引用文献数
10

We have built a singing database that can be used for research purposes. Since recent songs are protected by copyright law, researchers typically use songs that can be used without copyright. With changes to the copyright law in Japan in 2019, we can now release a singing database consisting of songs protected by the law under several restrictions. Our database mainly consists of Japanese pop songs by a professional singer. We collected a total of 50 songs with around 57 minutes of vocals recorded in a studio. After recording, we labeled the phoneme boundaries and converted the songs into the MusicXML format required for the study of statistical parametric singing synthesis. Statistical analysis of the database was then carried out. First, we counted the number of phonemes to clarify their distribution. Second, we performed acoustical analysis on the distribution of pitch, the interval between notes, and duration. Results showed that although the information is biased, the amount of singing is sufficient in light of the findings of a prior study on singing synthesis. The corpus is freely available at our website, https://zunko.jp/kiridev/login.php [1].
著者
Masanori MORISE Fumiya YOKOMORI Kenji OZAWA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E99.D, no.7, pp.1877-1884, 2016-07-01 (Released:2016-07-01)
参考文献数
38
被引用文献数
91 571

A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of real-time applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing.
著者
Junya KOGUCHI Shinnosuke TAKAMICHI Masanori MORISE Hiroshi SARUWATARI Shigeki SAGAYAMA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103-D, no.12, pp.2673-2681, 2020-12-01
被引用文献数
2

We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
著者
Shinya HORIIKE Masanori MORISE
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103-D, no.5, pp.1199-1202, 2020-05-01

To improve the likability of speech, we propose a voice conversion algorithm by controlling the fundamental frequency (F0) and the spectral envelope and carry out a subjective evaluation. The subjects can manipulate these two speech parameters. From the result, the subjects preferred speech with a parameter related to higher brightness.
著者
Shinya HORIIKE Masanori MORISE
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103.D, no.5, pp.1199-1202, 2020-05-01 (Released:2020-05-01)
参考文献数
13

To improve the likability of speech, we propose a voice conversion algorithm by controlling the fundamental frequency (F0) and the spectral envelope and carry out a subjective evaluation. The subjects can manipulate these two speech parameters. From the result, the subjects preferred speech with a parameter related to higher brightness.
著者
Hideki Banno Hiroaki Hata Masanori Morise Toru Takahashi Toshio Irino Hideki Kawahara
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.28, no.3, pp.140-146, 2007 (Released:2007-05-01)
参考文献数
19
被引用文献数
11 28

A very high quality speech analysis, modification and synthesis system—STRAIGHT—has now been implemented in C language and operated in realtime. This article first provides a brief summary of STRAIGHT components and then introduces the underlying principles that enabled realtime operation. In STRAIGHT, the built-in extended pitch synchronous analysis, which does not require analysis window alignment, plays an important role in realtime implementation. A detailed description of the processing steps, which are based on the so-called “just-in-time” architecture, is presented. Further, discussions on other issues related to realtime implementation and performance measures are also provided. The software will be available to researchers upon request.