著者
Asuka NAKAJIMA Takuya WATANABE Eitaro SHIOJI Mitsuaki AKIYAMA Maverick WOO
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103.D, no.7, pp.1524-1540, 2020-07-01 (Released:2020-07-01)
参考文献数
40

With our ever increasing dependence on computers, many governments around the world have started to investigate strengthening the regulations on vulnerabilities and their lifecycle management. Although many previous works have studied this problem space for mainstream software packages and web applications, relatively few have studied this for consumer IoT devices. As our first step towards filling this void, this paper presents a pilot study on the vulnerability disclosures and patch releases of three prominent consumer IoT vendors in Japan and three in the United States. Our goals include (i) characterizing the trends and risks in the vulnerability lifecycle management of consumer IoT devices using accurate long-term data, and (ii) identifying problems, challenges, and potential approaches for future studies of this problem space. To this end, we collected all published vulnerabilities and patches related to the consumer IoT products by the included vendors between 2006 and 2017; then, we analyzed our dataset from multiple perspectives, such as the severity of the included vulnerabilities and the timing of the included patch releases with respect to the corresponding disclosures and exploits. Our work has uncovered several important findings that may inform future studies. These findings include (i) a stark contrast between how the vulnerabilities in our dataset were disclosed in the two markets, (ii) three alarming practices by the included vendors that may significantly increase the risk of 1-day exploits for customers, and (iii) challenges in data collection including crawling automation and long-term data availability. For each finding, we also provide discussions on its consequences and/or potential migrations or suggestions.
著者
Takaaki SAEKI Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104.D, no.7, pp.1002-1016, 2021-07-01 (Released:2021-07-01)
参考文献数
41
被引用文献数
2

This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectral-differential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a lifter-training method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multi-rate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.
著者
Mariana RODRIGUES MAKIUCHI Tifani WARNITA Nakamasa INOUE Koichi SHINODA Michitaka YOSHIMURA Momoko KITAZAWA Kei FUNAKI Yoko EGUCHI Taishiro KISHIMOTO
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104.D, no.11, pp.1930-1940, 2021-11-01 (Released:2021-11-01)
参考文献数
71
被引用文献数
6

We propose a non-invasive and cost-effective method to automatically detect dementia by utilizing solely speech audio data. We extract paralinguistic features for a short speech segment and use Gated Convolutional Neural Networks (GCNN) to classify it into dementia or healthy. We evaluate our method on the Pitt Corpus and on our own dataset, the PROMPT Database. Our method yields the accuracy of 73.1% on the Pitt Corpus using an average of 114 seconds of speech data. In the PROMPT Database, our method yields the accuracy of 74.7% using 4 seconds of speech data and it improves to 80.8% when we use all the patient's speech data. Furthermore, we evaluate our method on a three-class classification problem in which we included the Mild Cognitive Impairment (MCI) class and achieved the accuracy of 60.6% with 40 seconds of speech data.
著者
Daiki CHIBA Ayako AKIYAMA HASEGAWA Takashi KOIDE Yuta SAWABE Shigeki GOTO Mitsuaki AKIYAMA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103.D, no.7, pp.1493-1511, 2020-07-01 (Released:2020-07-01)
参考文献数
70
被引用文献数
3

Internationalized domain names (IDNs) are abused to create domain names that are visually similar to those of legitimate/popular brands. In this work, we systematize such domain names, which we call deceptive IDNs, and analyze the risks associated with them. In particular, we propose a new system called DomainScouter to detect various deceptive IDNs and calculate a deceptive IDN score, a new metric indicating the number of users that are likely to be misled by a deceptive IDN. We perform a comprehensive measurement study on the identified deceptive IDNs using over 4.4 million registered IDNs under 570 top-level domains (TLDs). The measurement results demonstrate that there are many previously unexplored deceptive IDNs targeting non-English brands or combining other domain squatting methods. Furthermore, we conduct online surveys to examine and highlight vulnerabilities in user perceptions when encountering such IDNs. Finally, we discuss the practical countermeasures that stakeholders can take against deceptive IDNs.
著者
Graham NEUBIG Masato MIMURA Shinsuke MORI Tatsuya KAWAHARA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E95.D, no.2, pp.614-625, 2012-02-01 (Released:2012-02-01)
参考文献数
40
被引用文献数
11 24 6

We propose a novel scheme to learn a language model (LM) for automatic speech recognition (ASR) directly from continuous speech. In the proposed method, we first generate phoneme lattices using an acoustic model with no linguistic constraints, then perform training over these phoneme lattices, simultaneously learning both lexical units and an LM. As a statistical framework for this learning problem, we use non-parametric Bayesian statistics, which make it possible to balance the learned model's complexity (such as the size of the learned vocabulary) and expressive power, and provide a principled learning algorithm through the use of Gibbs sampling. Implementation is performed using weighted finite state transducers (WFSTs), which allow for the simple handling of lattice input. Experimental results on natural, adult-directed speech demonstrate that LMs built using only continuous speech are able to significantly reduce ASR phoneme error rates. The proposed technique of joint Bayesian learning of lexical units and an LM over lattices is shown to significantly contribute to this improvement.
著者
Takaharu KATO Ikuko SHIMIZU Tomas PAJDLA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E105.D, no.9, pp.1590-1599, 2022-09-01 (Released:2022-09-01)
参考文献数
39

Selecting visually overlapping image pairs without any prior information is an essential task of large-scale structure from motion (SfM) pipelines. To address this problem, many state-of-the-art image retrieval systems adopt the idea of bag of visual words (BoVW) for computing image-pair similarity. In this paper, we present a method for improving the image pair selection using BoVW. Our method combines a conventional vector-based approach and a set-based approach. For the set similarity, we introduce a modified version of the Simpson (m-Simpson) coefficient. We show the advantage of this measure over three typical set similarity measures and demonstrate that the combination of vector similarity and the m-Simpson coefficient effectively reduces false positives and increases accuracy. To discuss the choice of vocabulary construction, we prepared both a sampled vocabulary on an evaluation dataset and a basic pre-trained vocabulary on a training dataset. In addition, we tested our method on vocabularies of different sizes. Our experimental results show that the proposed method dramatically improves precision scores especially on the sampled vocabulary and performs better than the state-of-the-art methods that use pre-trained vocabularies. We further introduce a method to determine the k value of top-k relevant searches for each image and show that it obtains higher precision at the same recall.
著者
Yuya KAMATAKI Yusuke KAMEDA Yasuyo KITA Ichiro MATSUDA Susumu ITOH
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104.D, no.10, pp.1572-1575, 2021-10-01 (Released:2021-10-01)
参考文献数
11

This paper proposes a lossless coding method for HDR color images stored in a floating point format called Radiance RGBE. In this method, three mantissa and a common exponent parts, each of which is represented in 8-bit depth, are encoded using the block-adaptive prediction technique with some modifications considering the data structure.
著者
Daisuke OKU Kotaro TERADA Masato HAYASHI Masanao YAMAOKA Shu TANAKA Nozomu TOGAWA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E102-D, no.9, pp.1696-1706, 2019-09-01
被引用文献数
22

Combinatorial optimization problems with a large solution space are difficult to solve just using von Neumann computers. Ising machines or annealing machines have been developed to tackle these problems as a promising Non-von Neumann computer. In order to use these annealing machines, every combinatorial optimization problem is mapped onto the physical Ising model, which consists of spins, interactions between them, and their external magnetic fields. Then the annealing machines operate so as to search the ground state of the physical Ising model, which corresponds to the optimal solution of the original combinatorial optimization problem. A combinatorial optimization problem can be firstly described by an ideal fully-connected Ising model but it is very hard to embed it onto the physical Ising model topology of a particular annealing machine, which causes one of the largest issues in annealing machines. In this paper, we propose a fully-connected Ising model embedding method targeting for CMOS annealing machine. The key idea is that the proposed method replicates every logical spin in a fully-connected Ising model and embeds each logical spin onto the physical spins with the same chain length. Experimental results through an actual combinatorial problem show that the proposed method obtains spin embeddings superior to the conventional de facto standard method, in terms of the embedding time and the probability of obtaining a feasible solution.
著者
Kei SAWADA Akira TAMAMORI Kei HASHIMOTO Yoshihiko NANKAKU Keiichi TOKUDA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E99-D, no.12, pp.3119-3131, 2016-12-01

This paper proposes a Bayesian approach to image recognition based on separable lattice hidden Markov models (SL-HMMs). The geometric variations of the object to be recognized, e.g., size, location, and rotation, are an essential problem in image recognition. SL-HMMs, which have been proposed to reduce the effect of geometric variations, can perform elastic matching both horizontally and vertically. This makes it possible to model not only invariances to the size and location of the object but also nonlinear warping in both dimensions. The maximum likelihood (ML) method has been used in training SL-HMMs. However, in some image recognition tasks, it is difficult to acquire sufficient training data, and the ML method suffers from the over-fitting problem when there is insufficient training data. This study aims to accurately estimate SL-HMMs using the maximum a posteriori (MAP) and variational Bayesian (VB) methods. The MAP and VB methods can utilize prior distributions representing useful prior information, and the VB method is expected to obtain high generalization ability by marginalization of model parameters. Furthermore, to overcome the local maximum problem in the MAP and VB methods, the deterministic annealing expectation maximization algorithm is applied for training SL-HMMs. Face recognition experiments performed on the XM2VTS database indicated that the proposed method offers significantly improved image recognition performance. Additionally, comparative experiment results showed that the proposed method was more robust to geometric variations than convolutional neural networks.
著者
Kazuhiro NAKAMURA Kei HASHIMOTO Yoshihiko NANKAKU Keiichi TOKUDA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E97-D, no.6, pp.1438-1448, 2014-06-01

This paper proposes a novel approach for integrating spectral feature extraction and acoustic modeling in hidden Markov model (HMM) based speech synthesis. The statistical modeling process of speech waveforms is typically divided into two component modules: the frame-by-frame feature extraction module and the acoustic modeling module. In the feature extraction module, the statistical mel-cepstral analysis technique has been used and the objective function is the likelihood of mel-cepstral coefficients for given speech waveforms. In the acoustic modeling module, the objective function is the likelihood of model parameters for given mel-cepstral coefficients. It is important to improve the performance of each component module for achieving higher quality synthesized speech. However, the final objective of speech synthesis systems is to generate natural speech waveforms from given texts, and the improvement of each component module does not always lead to the improvement of the quality of synthesized speech. Therefore, ideally all objective functions should be optimized based on an integrated criterion which well represents subjective speech quality of human perception. In this paper, we propose an approach to model speech waveforms directly and optimize the final objective function. Experimental results show that the proposed method outperformed the conventional methods in objective and subjective measures.
著者
Tomo NIIZUMA Hideaki GOTO
出版者
一般社団法人 電子情報通信学会
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E100.D, no.3, pp.511-519, 2017-03-01 (Released:2017-03-01)
参考文献数
24
被引用文献数
3

Wireless LAN (WLAN) roaming systems, such as eduroam, enable the mutual use of WLAN facilities among multiple organizations. As a consequence of the strong demand for WLAN roaming, it is utilized not only at universities and schools but also at the venues of large events such as concerts, conferences, and sports events. Moreover, it has also been reported that WLAN roaming is useful in areas afflicted by natural disasters. This paper presents a novel WLAN roaming system over Wireless Mesh Networks (WMNs) that is useful for the use cases shown above. The proposed system is based on two methods as follows: 1) Automatic authentication path generation method decreases the WLAN roaming system deployment costs including the wiring cost and configuration cost. Although the wiring cost can be reduced by using WMN technologies, some additional configurations are still required if we want to deploy a secure user authentication mechanism (e.g. IEEE 802.1X) on WLAN systems. In the proposed system, the Access Points (APs) can act as authenticators automatically using RadSec instead of RADIUS. Therefore, the network administrators can deploy 802.1X-based authentication systems over WMNs without additional configurations on-site. 2) Local authentication method makes the system deployable in times of natural disasters, in particular when the upper network is unavailable or some authentication servers or proxies are down. In the local authentication method, users and APs can be authenticated at the WMN by locally verifying the digital certificates as the authentication credentials.
著者
Takashi NORIMATSU Yuichi NAKAMURA Toshihiro YAMAUCHI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E106-D, no.9, pp.1364-1379, 2023-09-01

Two problems occur when an authorization server is utilized for a use case where a different security profile needs to be applied to a unique client request for accessing a distinct type of an API, such as open banking. A security profile can be applied to a client request by using the settings of an authorization server and client. However, this method can only apply the same security profile to all client requests. Therefore, multiple authorization servers or isolated environments, such as realms of an authorization server, are needed to apply a different security profile. However, this increases managerial costs for the authorization server administration. Moreover, new settings and logic need to be added to an authorization server if the existing client settings are inadequate for applying a security profile, which requires modification of an authorization server's source code. We aims to propose the policy-based method that resolves these problems. The proposed method does not completely rely on the settings of a client and can determine an applied security profile using a policy and the context of the client's request. Therefore, only one authorization server or isolated environment, such as a realm of an authorization server, is required to support multiple different security profiles. Additionally, the proposed method can implement a security profile as a pluggable software module. Thus, the source code of the authorization server need not be modified. The proposed method and Financial-grade application programming interface (FAPI) security profiles were implemented in Keycloak, which is an open-source identity and access management solution, and evaluation scenarios were executed. The results of the evaluation confirmed that the proposed method resolves these problems. The implementation has been contributed to Keycloak, making the proposed method and FAPI security profiles publicly available.
著者
Kiyoshi KURIHARA Nobumasa SEIYAMA Tadashi KUMANO
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104.D, no.2, pp.302-311, 2021-02-01 (Released:2021-02-01)
参考文献数
35
被引用文献数
11

This paper describes a method to control prosodic features using phonetic and prosodic symbols as input of attention-based sequence-to-sequence (seq2seq) acoustic modeling (AM) for neural text-to-speech (TTS). The method involves inserting a sequence of prosodic symbols between phonetic symbols that are then used to reproduce prosodic acoustic features, i.e. accents, pauses, accent breaks, and sentence endings, in several seq2seq AM methods. The proposed phonetic and prosodic labels have simple descriptions and a low production cost. By contrast, the labels of conventional statistical parametric speech synthesis methods are complicated, and the cost of time alignments such as aligning the boundaries of phonemes is high. The proposed method does not need the boundary positions of phonemes. We propose an automatic conversion method for conventional labels and show how to automatically reproduce pitch accents and phonemes. The results of objective and subjective evaluations show the effectiveness of our method.
著者
Md Ashraful ISLAM Kenji KISE
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E105.D, no.9, pp.1506-1515, 2022-09-01 (Released:2022-09-01)
参考文献数
36
被引用文献数
1

For the increasing demands of computation, heterogeneous multicore architecture is believed to be a promising solution to fulfill the edge computational requirement. In FPGAs, the heterogeneous multicore is realized as multiple soft processor cores with custom processing elements. Since FPGA is a resource-constrained device, sharing the hardware resources among the soft processor cores can be advantageous. A few research works have focused on the resource sharing between soft processors, but they do not study how much FPGA logic is minimized for a different pipeline processor. This paper proposes the microarchitecture of four, and five stage pipeline processors that enables the sharing of functional units for execution among the multiple cores as well as sharing the BRAM ports. We then investigate the performance and hardware resource utilization for a four-core processor. We find that sharing different functional units can save the LUT usage to 31.7% and DSP usage to 75%. We analyze the performance impact of sharing from the simulation of the Embench benchmark program. Our simulation results indicate that for some cases the sharing improves the performance and for other configurations worst-case performance drop is 16.7%.
著者
Yusuke HARA Xueting WANG Toshihiko YAMASAKI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104-D, no.8, pp.1349-1358, 2021-08-01
被引用文献数
1

Video inpainting is a task of filling missing regions in videos. In this task, it is important to efficiently use information from other frames and generate plausible results with sufficient temporal consistency. In this paper, we present a video inpainting method jointly using affine transformation and deformable convolutions for frame alignment. The former is responsible for frame-scale rough alignment and the latter performs pixel-level fine alignment. Our model does not depend on 3D convolutions, which limits the temporal window, or troublesome flow estimation. The proposed method achieves improved object removal results and better PSNR and SSIM values compared with previous learning-based methods.
著者
Takashi NOSE Yuhei OTA Takao KOBAYASHI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E93-D, no.9, pp.2483-2490, 2010-09-01
被引用文献数
9

We propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available.
著者
Pilsung KANG
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E102.D, no.8, pp.1565-1568, 2019-08-01 (Released:2019-08-01)
参考文献数
15
被引用文献数
1

We present an OpenACC-based parallelization implementation of stochastic algorithms for simulating biochemical reaction networks on modern GPUs (graphics processing units). To investigate the effectiveness of using OpenACC for leveraging the massive hardware parallelism of the GPU architecture, we carefully apply OpenACC's language constructs and mechanisms to implementing a parallel version of stochastic simulation algorithms on the GPU. Using our OpenACC implementation in comparison to both the NVidia CUDA and the CPU-based implementations, we report our initial experiences on OpenACC's performance and programming productivity in the context of GPU-accelerated scientific computing.
著者
POLIKOVSKY Senya KAMEDA Yoshinari OHTA Yuichi
出版者
電子情報通信学会
雑誌
IEICE transactions on information and systems (ISSN:09168532)
巻号頁・発行日
vol.E96.D, no.1, pp.81-92, 2013-01
被引用文献数
51 1

Facial micro-expressions are fast and subtle facial motions that are considered as one of the most useful external signs for detecting hidden emotional changes in a person. However, they are not easy to detect and measure as they appear only for a short time, with small muscle contraction in the facial areas where salient features are not available. We propose a new computer vision method for detecting and measuring timing characteristics of facial micro-expressions. The core of this method is based on a descriptor that combines pre-processing masks, histograms and concatenation of spatial-temporal gradient vectors. Presented 3D gradient histogram descriptor is able to detect and measure the timing characteristics of the fast and subtle changes of the facial skin surface. This method is specifically designed for analysis of videos recorded using a hi-speed 200fps camera. Final classification of micro expressions is done by using a k-mean classifier and a voting procedure. The Facial Action Coding System was utilized to annotate the appearance and dynamics of the expressions in our new hi-speed micro-expressions video database. The efficiency of the proposed approach was validated using our new hi-speed video database.
著者
Yasunobu TOYOTA Wataru MISHIMA Koichiro KANAYA Osamu NAKAMURA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E106.D, no.5, pp.927-939, 2023-05-01 (Released:2023-05-01)
参考文献数
46

QoS of applications is essential for content providers, and it is required to improve the end-to-end communication quality from a content provider to users. Generally, a content provider's data center network is connected to multiple ASes and has multiple egress paths to reach the content user's network. However, on the Internet, the communication quality of network paths outside of the provider's administrative domain is a black box, so multiple egress paths cannot be quantitatively compared. In addition, it is impossible to determine a unique egress path within a network domain because the parameters that affect the QoS of the content are different for each network. We propose a “Performance Aware Egress Path Discovery” method to improve QoS for content providers. The proposed method uses two techniques: Egress Peer Engineering with Segment Routing over IPv6 and Passive End-to-End Measurement. The method is superior in that it allows various metrics depending on the type of content and can be used for measurements without affecting existing systems. To evaluate our method, we deployed the Performance Aware Egress Path Discovery System in an existing content provider network and conducted experiments to provide production services. Our findings from the experiment show that, in this network, 15.9% of users can expect a 30Mbps throughput improvement, and 13.7% of users can expect a 10ms RTT improvement.
著者
Masanori MORISE Fumiya YOKOMORI Kenji OZAWA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E99.D, no.7, pp.1877-1884, 2016-07-01 (Released:2016-07-01)
参考文献数
38
被引用文献数
91 537

A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of real-time applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing.