- 著者
-
小林 由弥
鈴木 雅大
松尾 豊
- 出版者
- 一般社団法人 人工知能学会
- 雑誌
- 人工知能学会論文誌 (ISSN:13460714)
- 巻号頁・発行日
- vol.37, no.2, pp.I-L75_1-17, 2022-03-01 (Released:2022-03-01)
- 参考文献数
- 63
Ability to understand surrounding environment based on its components, namely objects, is one of the most important cognitive ability for intelligent agents. Human beings are able to decompose sensory input, i.e. visual stimulation, into some components based on its meaning or relationships between entities, and are able to recognize those components as “object ”. It is often said that this kind of compositional recognition ability is essential for resolving so called Binding Problem, and thus important for many tasks such as planning, decision making and reasoning. Recently, researches about obtaining object level representation in unsupervised manner using deep generative models have been gaining much attention, and they are called ”Scene Interpretation models”. Scene Interpretation models are able to decompose input scenes into symbolic entities such as objects, and represent them in a compositional way. The objective of our research is to point out the weakness of existing scene interpretation methods and propose some methods to improve them. Scene Interpretation models are trained in fully-unsupervised manner in contrast to latest methods in computer vision which are based on massive labeled data. Due to this problem setting, scene interpretation models lack inductive biases to recognize objects. Therefore, the application of these models are restricted to relatively simple toy datasets. It is widely known that introducing inductive biases to machine learning models is sometimes very useful like convolutional neural networks, but how to introduce them via training depends on the models and is not always obvious. In this research, we propose to incorporate self-supervised learning to scene interpretation models for introducing additional inductive bias to the models, and we also propose a model architecture using Transformer which is considered to be suitable for scene interpretation when combined with self-supervised learning. We show proposed methods outperforms previous methods, and is able to adopt to Multi-MNIST dataset which previous methods could not deal with well.