著者
Yuki Saito Kohei Yatabe Shogun
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
pp.e23.67, (Released:2023-12-02)
参考文献数
11

Understanding of gameplay can enhance the experience and entertainment of video game. In this study, we propose to utilize the sound generated by a controller for analyzing the information of gameplay. Controller sound is a user-friendly feature related to gameplay because it can be very easily recorded. As a first step of the research, we performed identification of characters of Super Smash Bros. Ultimate only from controller sound as an example task for examining whether controller sound contains valuable information. The results showed that our model achieved 79% accuracy for identification of five characters only using the controller sound.
著者
Takaaki SAEKI Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104.D, no.7, pp.1002-1016, 2021-07-01 (Released:2021-07-01)
参考文献数
41
被引用文献数
2

This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectral-differential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a lifter-training method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multi-rate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.
著者
Shinnosuke Takamichi Ryosuke Sonobe Kentaro Mitsui Yuki Saito Tomoki Koriyama Naoko Tanji Hiroshi Saruwatari
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.41, no.5, pp.761-768, 2020-09-01 (Released:2020-09-01)
参考文献数
50
被引用文献数
17

In this paper, we develop two corpora for speech synthesis research. Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we aim at developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In this paper, we construct the JSUT and JVS corpora. They are designed mainly for text-to-speech synthesis and voice conversion, respectively. The JSUT corpus contains 10 hours of reading-style speech uttered by a single speaker, and the JVS corpus contains 30 hours containing three styles of speech uttered by 100 speakers. This paper describes how we designed the corpora and summarizes the specifications. The corpora are available at our project pages.
著者
Takahiro Yamauchi Yasuo Okumura Koichi Nagashima Ryuta Watanabe Yuki Saito Katsuaki Yokoyama Naoya Matsumoto Katsumi Miyauchi Sakiko Miyazaki Hidemori Hayashi Yuya Matsue Yuji Nishizaki Shuko Nojiri Tohru Minamino Hiroyuki Daida
出版者
The Japanese Circulation Society
雑誌
Circulation Journal (ISSN:13469843)
巻号頁・発行日
pp.CJ-23-0318, (Released:2023-08-09)
参考文献数
30
被引用文献数
2

Background: The HELT-E2S2score, which assigns 1 point to Hypertension, Elderly aged 75–84 years, Low body mass index <18.5 kg/m2, and Type of atrial fibrillation (AF: persistent/permanent), and 2 points to Extreme Elderly aged ≥85 years and previous Stroke, has been proposed as a new risk stratification for strokes in Japanese AF patients, but has not yet undergone external validation.Methods and Results: We evaluated the prognostic performance of the HELT-E2S2score for stroke risk stratification using 2 large-scale registries in Japanese AF patients (n=7,020). During 23,241 person-years of follow-up (mean follow-up 1,208±450 days), 287 ischemic stroke events occurred. The C-statistic using the HELT-E2S2score was 0.661 (95% confidence interval [CI], 0.629–0.692), which was numerically higher than with the CHADS2score (0.644, 95% CI 0.613–0.675; P=0.15 vs. HELT-E2S2) or CHA2DS2-VASc score (0.650, 95% CI, 0.619–0.680; P=0.37 vs. HELT-E2S2). In the SAKURA AF Registry, the C-statistic of the HELT-E2S2score was consistently higher than the CHADS2and CHA2DS2-VASc scores across all 3 types of facilities comprising university hospitals, general hospitals, and clinics. However, in the RAFFINE Study, its superiority was only observed in general hospitals.Conclusions: The HELT-E2S2score demonstrated potential value for risk stratification, particularly in a super-aged society such as Japan. However, its superiority over the CHADS2or CHA2DS2-VASc scores may vary across different hospital settings.
著者
Hiroki TAMARU Yuki SAITO Shinnosuke TAKAMICHI Tomoki KORIYAMA Hiroshi SARUWATARI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103.D, no.3, pp.639-647, 2020-03-01 (Released:2020-03-01)
参考文献数
32
被引用文献数
3

This paper proposes a generative moment matching network (GMMN)-based post-filtering method for providing inter-utterance pitch variation to singing voices and discusses its application to our developed mixing method called neural double-tracking (NDT). When a human singer sings and records the same song twice, there is a difference between the two recordings. The difference, which is called inter-utterance variation, enriches the performer's musical expression and the audience's experience. For example, it makes every concert special because it never recurs in exactly the same manner. Inter-utterance variation enables a mixing method called double-tracking (DT). With DT, the same phrase is recorded twice, then the two recordings are mixed to give richness to singing voices. However, in synthesized singing voices, which are commonly used to create music, there is no inter-utterance variation because the synthesis process is deterministic. There is also no inter-utterance variation when only one voice is recorded. Although there is a signal processing-based method called artificial DT (ADT) to layer singing voices, the signal processing results in unnatural sound artifacts. To solve these problems, we propose a post-filtering method for randomly modulating synthesized or natural singing voices as if the singer sang again. The post-filter built with our method models the inter-utterance pitch variation of human singing voices using a conditional GMMN. Evaluation results indicate that 1) the proposed method provides perceptible and natural inter-utterance variation to synthesized singing voices and that 2) our NDT exhibits higher double-trackedness than ADT when applied to both synthesized and natural singing voices.
著者
Yuki SAITO Kei AKUZAWA Kentaro TACHIBANA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103-D, no.9, pp.1978-1987, 2020-09-01

This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.
著者
Seiichi Mokuyasu Risa Oshitanai Toru Morioka Yuki Saito Yasuhiro Suzuki
出版者
Japanese Society of Laboratory Medicine
雑誌
Laboratory Medicine International (ISSN:24368660)
巻号頁・発行日
vol.2, no.3, pp.50-59, 2023 (Released:2023-12-29)
参考文献数
30

Background: Absolute lymphocyte count (ALC) and neutrophil lymphocyte rate (NLR) as immune system and inflammatory markers have been suggested as prognostic factors in eribulin treatment. However, the respective cut-off values have not been determined. Hence, we investigated the relationship between overall survival (OS) and baseline ALC (bALC) and baseline NLR (bNLR) in eribulin-treated patients with human epidermal growth factor receptor 2 (HER2)-negative breast cancer (BC) by using 2 types of cut-off values for each. Methods: Univariate and multivariate analyses were performed to investigate the association of bALC and bNLR with OS among 114 female patients with HER2-negative BC treated with eribulin. Results: The OS of patients with HER2-negative BC was compared based on bALC (cut-off value: 1,200/μL and 1,500/μL) and bNLR (cut-off values: 2 and 3). A significant difference was observed in median OS between patients with bALC of ≥ 1,200/μL and those with bALC of < 1,200/μL (hazard ratio [HR]: 0.596 [0.395, 0.889], p = 0.014). For bNLR (cut-off value: 2), the median OS was significantly higher in patients with a bNLR of < 2 than in those with a bNLR of ≥ 2 (HR: 0.629 [0.406, 0.974], p = 0.038). Conclusions: Patients with HER2-negative BC with a bALC of ≥ 1,200/μL showed a longer OS than patients with a bALC of < 1,200/μL, thus suggesting that survival prediction using bALC was effective for eribulin-treated patients with recurrent HER2-negative BC. It should be noted that the optimal cut-off value for ALC may change depending on the target patient group.
著者
Yuki SAITO Kei AKUZAWA Kentaro TACHIBANA
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E103.D, no.9, pp.1978-1987, 2020-09-01 (Released:2020-09-01)
参考文献数
53

This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.
著者
Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
出版者
一般社団法人 電子情報通信学会
雑誌
IEICE Transactions on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E100.D, no.8, pp.1925-1928, 2017-08-01 (Released:2017-08-01)
参考文献数
20
被引用文献数
19

This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.
著者
Hidesato Fujito Yuki Saito Haruna Nishimaki Yusuke Hori Yasunari Ebuchi Hiroyuki Hao Yasuo Okumura
出版者
International Heart Journal Association
雑誌
International Heart Journal (ISSN:13492365)
巻号頁・発行日
vol.62, no.2, pp.432-436, 2021-03-30 (Released:2021-03-30)
参考文献数
18

Embolic myocardial infarction (MI) caused by infective endocarditis (IE) is rare, but it is increasingly recognized as an important complication. This complication typically occurs in patients with aortic valve endocarditis during the acute phase of the infection. It is also known to have a high mortality rate; however, the best practice for its management is unclear owing to scarce available data. In addition, most cases of embolic acute MI (AMI) caused by IE are indirectly diagnosed with a combination of angiographic examination such as coronary angiography or cardiac computed tomography. Herein, we report a case of fatal embolic ST-elevation MI (STEMI) caused by mitral valve IE during the healed phase, which was clearly proven by the pathology findings.
著者
Yuki Saito Taiki Nakamura Yusuke Ijima Kyosuke Nishida Shinnosuke Takamichi
出版者
ACOUSTICAL SOCIETY OF JAPAN
雑誌
Acoustical Science and Technology (ISSN:13463969)
巻号頁・発行日
vol.42, no.1, pp.1-11, 2021-01-01 (Released:2021-01-01)
参考文献数
34
被引用文献数
1

We propose non-parallel and many-to-many voice conversion (VC) using variational autoencoders (VAEs) that constructs VC models for converting arbitrary speakers' characteristics into those of other arbitrary speakers without parallel speech corpora for training the models. Although VAEs conditioned by one-hot coded speaker codes can achieve non-parallel VC, the phonetic contents of the converted speech tend to vanish, resulting in degraded speech quality. Another issue is that they cannot deal with unseen speakers not included in training corpora. To overcome these issues, we incorporate deep-neural-network-based automatic speech recognition (ASR) and automatic speaker verification (ASV) into the VAE-based VC. Since phonetic contents are given as phonetic posteriorgrams predicted from the ASR models, the proposed VC can overcome the quality degradation. Our VC utilizes d-vectors extracted from the ASV models as continuous speaker representations that can deal with unseen speakers. Experimental results demonstrate that our VC outperforms the conventional VAE-based VC in terms of mel-cepstral distortion and converted speech quality. We also investigate the effects of hyperparameters in our VC and reveal that 1) a large d-vector dimensionality that gives the better ASV performance does not necessarily improve converted speech quality, and 2) a large number of pre-stored speakers improves the quality.
著者
Yuki Saito Mahoto Kato Koichi Nagashima Koyuru Monno Yoshihiro Aizawa Yasuo Okumura Naoki Matsumoto Mitsuhiko Moriyama Atsushi Hirayama
出版者
The Japanese Circulation Society
雑誌
Circulation Journal (ISSN:13469843)
巻号頁・発行日
vol.82, no.7, pp.1822-1829, 2018-06-25 (Released:2018-06-25)
参考文献数
29
被引用文献数
18 43

Background:Acute decompensated heart failure (ADHF) is often accompanied by liver congestion through increased right atrial pressure (RAP). Liver stiffness (LS) assessed non-invasively using transient elastography is related to increased RAP and liver congestion in patients with general HF. We investigated the relationship of LS with clinical and echocardiographic variables and outcomes in patients with ADHF.Methods and Results:The subjects were 105 patients with ADHF admitted to hospital between October 2016 and June 2017. Patients were divided into 2 groups based on median LS at admission (low LS <8.8 kPa [n=52] vs. high LS ≥8.8 kPa [n=53]). Death from cardiovascular disease and readmission for HF were primary endpoints. Total bilirubin and γ-glutamyl transpeptidase levels, MELD-XI score, diameters of the inferior vena cava and right ventricle, and severity of tricuspid regurgitation were greater in the high LS group (all P<0.05). During a median (interquartile range) follow-up period of 153 (83–231) days, cardiac events occurred in 29 patients (54%) in the high LS group and in 13 (25%) in the low LS group (P=0.001). After adjusting for variables that influence organ congestion, a high LS ≥8.8 kPa was still significantly associated with cardiac events (all P<0.05).Conclusions:Increased LS measured by transient elastography reflects RAP elevation, hepatic congestion, and hepatic dysfunction. LS upon admission may be a useful prognostic marker in patients with ADHF.
著者
Satoshi MIZOGUCHI Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104-D, no.11, pp.1971-1980, 2021-11-01
被引用文献数
1

We propose deep neural network (DNN)-based speech enhancement that reduces musical noise and achieves better auditory impressions. The musical noise is an artifact generated by nonlinear signal processing and negatively affects the auditory impressions. We aim to develop musical-noise-free speech enhancement methods that suppress the musical noise generation and produce perceptually-comfortable enhanced speech. DNN-based speech enhancement using a soft mask achieves high noise reduction but generates musical noise in non-speech regions. Therefore, first, we define kurtosis matching for DNN-based low-musical-noise speech enhancement. Kurtosis is the fourth-order moment and is known to correlate with the amount of musical noise. The kurtosis matching is a penalty term of the DNN training and works to reduce the amount of musical noise. We further extend this scheme to standardized-moment matching. The extended scheme involves using moments whose orders are higher than kurtosis and generalizes the conventional musical-noise-free method based on kurtosis matching. We formulate standardized-moment matching and explore how effectively the higher-order moments reduce the amount of musical noise. Experimental evaluation results 1) demonstrate that kurtosis matching can reduce musical noise without negatively affecting noise suppression and 2) newly reveal that the sixth-moment matching also achieves low-musical-noise speech enhancement as well as kurtosis matching.
著者
Takaaki SAEKI Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E104-D, no.7, pp.1002-1016, 2021-07-01
被引用文献数
2

This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectral-differential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a lifter-training method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multi-rate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.
著者
Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
出版者
The Institute of Electronics, Information and Communication Engineers
雑誌
IEICE TRANSACTIONS on Information and Systems (ISSN:09168532)
巻号頁・発行日
vol.E100-D, no.8, pp.1925-1928, 2017-08-01

This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.