著者
小林 優佳 久島 務嗣 吉田 尚水 藤村 浩司 岩田 憲治
出版者
一般社団法人 人工知能学会
雑誌
人工知能学会論文誌 (ISSN:13460714)
巻号頁・発行日
vol.37, no.3, pp.IDS-D_1-14, 2022-05-01 (Released:2022-05-01)
参考文献数
63

This paper proposes a new method for slot filling of unknown slot values (i.e., those are not included in the training data) in spoken dialogue systems. Slot filling detects slot values from user utterances and handles named entities such as product and restaurant names. In the real world, there is a steady stream of new named entities and it would be infeasible to add all of them as training data. Accordingly, it is inevitable that users will input utterances with unknown slot values and spoken dialogue systems must correctly estimate them. We provide a value detector that detects keywords representing slot values ignoring slots and a slot estimator that estimates slots for detected keywords. Context information can be an important clue for estimating slot values because the values in a given slot tend to appear in similar contexts. The value detector is trained with positive samples, which have keywords corresponding to slot values replaced with random words, thereby enabling the use of context information. However, any approach that can detect unknown slot values may produce false alarms because the features of unknown slot values are unseen and it is difficult to distinguish keywords of unknown slot values from non-keywords, which do not correspond to slot values. Therefore, we introduce a negative sample method that replaces keywords with nonkeywords randomly, which allows the slot estimator to learn to reject non-keywords. Experimental results show that the proposed method achieves an 6,15 and 78% relative improvement in F1 score compared with an existing model on three datasets, respectively.
著者
吉田 尚水 中臺 一博 奥乃 博
出版者
The Robotics Society of Japan
雑誌
日本ロボット学会誌 (ISSN:02891824)
巻号頁・発行日
vol.28, no.8, pp.970-977, 2010

Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with human in a daily environment. In such an environment, Voice Activity Detection (VAD) performance becomes poor, and ASR performance deteriorates due to noises and VAD failures. To cope with these problems, it is said that humans improve speech recognition performance by using visual information like lip reading. Thus, we propose two-layered audio-visual integration framework for VAD and ASR. The two-layered AV integration framework includes three crucial methods. The first is Audio-Visual Voice Activity Detection (AV-VAD) based on Bayesian network. The second is a new lip-related visual feature which is robust for visual noises. The last one is microphone array processing to improve Signal-to-Noise Ratio (SNR) of input signal. We implemented prototype audio-visual speech recognition system based on our proposed framework using HARK which is our robot audition system. Through voice activity detection and speech recognition experiments, we showed the effectiveness of Audio-Visual integration, microphone array processing, and their combination for VAD and ASR. Preliminary results show that our system improves 20 and 9.7 points of ASR results with/without microphone array processing, respectively, and also improves robustness against several auditory/visual noise conditions.