著者
長崎 好輝 林 昌希 金子 直史 青木 義満
出版者
公益社団法人 精密工学会
雑誌
精密工学会誌 (ISSN:09120289)
巻号頁・発行日
vol.88, no.3, pp.263-268, 2022-03-05 (Released:2022-03-05)
参考文献数
10

In this paper, we propose a new method for audio-visual event localization 1) to find the corresponding segment between audio and visual event. While previous methods use Long Short-Term Memory (LSTM) networks to extract temporal features, recurrent neural networks like LSTM are not able to precisely learn long-term features. Thus, we propose a Temporal Cross-Modal Attention (TCMA) module, which extract temporal features more precisely from the two modalities. Inspired by the success of attention modules in capturing long-term features, we introduce TCMA, which incorporates self-attention. Finally, we were able to localize audio-visual event precisely and achieved a higher accuracy than the previous works.
著者
伊東 聖矢 金子 直史 鷲見 和彦
出版者
公益社団法人 精密工学会
雑誌
精密工学会誌 (ISSN:09120289)
巻号頁・発行日
vol.86, no.12, pp.1042-1050, 2020-12-05 (Released:2020-12-05)
参考文献数
35

Recent learning-based multi-view stereo (MVS) approaches have shown excellent performance. These approaches typically train a deep neural network to estimate dense depth maps from multiple images. However, most of these approaches require large-scale dense depth maps as the supervisory signals during training. This paper proposes a self-supervised learning framework for MVS, which learns to estimate dense depth maps from multiple images without dense depth supervision. Taking an arbitrary number of images as input, we produce sparse depth maps using structure from motion and use it as self-supervision. We apply reconstruction and smoothness losses to regions where there is no sparse depth. For stable training, we introduce a pseudo-depth loss, which is the difference between depth maps estimated by the network with the current and past parameters. Experimental results on multiple datasets demonstrate the effectiveness of our self-supervised learning framework.