- 著者
-
高橋 達二
甲野 佑
浦上 大輔
- 出版者
- 一般社団法人 人工知能学会
- 雑誌
- 人工知能学会論文誌 (ISSN:13460714)
- 巻号頁・発行日
- vol.31, no.6, pp.AI30-M_1-11, 2016-11-01 (Released:2016-12-26)
- 参考文献数
- 26
- 被引用文献数
-
3
As the scope of reinforcement learning broadens, the number of possible states and of executable actions, and hence the product of the two sets explode. Often, there are more feasible options than allowed trials, because of physical and computational constraints imposed on the agents. In such an occasion, optimization procedures that require first trying all the options once do not work. The situation is what the theory of bounded rationality was proposed to deal with. We formalize the central heuristics of bounded rationality theory named satisficing. Instead of the traditional formulation of satisficing at the policy level in terms of reinforcement learning, we introduce a value function that implements the asymmetric risk attitudes characteristic of human cognition. Operated under the simple greedy policy, the RS (reference satisficing) value function enables an efficient satisficing in K-armed bandit problems, and when the reference level for satisficing is set at an appropriate value, it leads to effective optimization. RS is also tested in a robotic motion learning task in which a robot learns to perform giant-swings (acrobot). While the standard algorithms fail because of the coarse-grained state space, RS shows a stable performance and autonomous exploration that goes without randomized exploration and its gradual annealing necessary for the standard methods.