著者
宮崎 和光 吉田 望 森 利枝
出版者
一般社団法人 電気学会
雑誌
電気学会論文誌C(電子・情報・システム部門誌) (ISSN:03854221)
巻号頁・発行日
vol.142, no.2, pp.117-128, 2022-02-01 (Released:2022-02-01)
参考文献数
32

In 2017, it became mandatory for universities in Japan to disclose their policies in degree granting (Diploma Policy: DP, hereafter) that state standards to confer degrees. Meanwhile, since 1991, nomenclature of major fields that appear in diplomas has been the responsibility of individual universities, instead of the national regulation. This study examines whether the former reasonably evokes the latter, given that both of them are deemed to represent the learning outcomes that the graduate has obtained. In order to do so, we compared the ability of humans and that of a deep-learning system (which is based on the Character-level CNN), to match DPs and major fields that are randomly given. In the examination of human ability, which was implemented with a large enough number of participants to obtain statistically significant results, we found there were a certain number of DPs that the majority of people failed to match with major fields. Given this fact, we analyzed such DPs to demonstrate that the deep learning system shows a high success rate in sorting out the DPs that poorly evoke major fields.
著者
宮崎 和光 木村 元 小林 重信
出版者
社団法人人工知能学会
雑誌
人工知能学会誌 (ISSN:09128085)
巻号頁・発行日
vol.14, no.5, pp.800-807, 1999-09-01
被引用文献数
42

1・1 工学の視点からみた強化学習 強化学習とは, 報酬という特別な人力を手がかりに環境に適応した行動決定戦略を追求する機械学習システムである. 強化学習の重要な特徴に, 1)報酬駆動型学習であること, 2)環境に対する先見的知識を前提としないこと, の2点がある. このことは, 「何をして欲しいか(what)」という目標を報酬に反映させるだけで, 「その実現方法(how to)」を学習システムに獲得させることを意味する. 強化学習システムは, 人間が考えた以上の解を発見する可能性がある. 加えて, 環境の一部が予め既知な場合には, 知識を組み込むことも可能である. この場合, 知識ベースが不完全であってもあるいは多少の誤りが含まれていても構わない. また, 強化学習は, ニューロやファジィなどの既存の手法との親和性が高い. さらに, 緩やかな環境変化には追従可能である. これらの理由から, 強化学習は工学的応用の観点から非常に魅力的な枠組と言える.
著者
荒井 幸代 宮崎 和光 小林 重信
出版者
一般社団法人 人工知能学会
雑誌
人工知能 (ISSN:21882266)
巻号頁・発行日
vol.13, no.4, pp.609-618, 1998-07-01 (Released:2020-09-29)
被引用文献数
1

Most of multi-agent systems have been developed in the field of Distributed Artificial Intelligence (DAI) whose schemes are based on plenty of pre-knowledge of the agents' world or organized relationships among the agents. However, these kind of knowledge would not be always available. On the other hand, multi-agent reinforcement learning is worth considering to realize the cooperative behavior among the agents with littls pre-knowledge. There are two main problems to be considered in multi-agent reinforcement learning. One is the uncertainty of state transition problem which is owing to concurrent learning of the agents. And the other is the perceptual aliasing problem which is generally held in such a world. Therefore, the robustness and flexibility are essential for the multi-agent reinforcement learning toward these two problems. In this paper, we evaluate Q-learning and Profit Sharing as the method for multi-agent reinforcement learning through some experiments. We take up the Pursuit Problem as one of the multi-agent world. In the experiments, we do not assume the existence of any pre-defined relationship among agents or any control knowledge for cooperation. Learning agents do not share sensation, episodes and policies. Each agent learns through its own episodes independent of the others. The result of experiments shows that cooperative behaviors emerge clearly among Profit Sharing hunters who are not influenced by concurrent learning even when the prey has the certain escaping way against the hunters. Moreover, they behave rational under the perceptual aliasing areas. On the other hand, Q-learning hunters can not make any policy in such a world. Through these experiments, we conclude that Profit Sharing has the good properties for multi-agent reinforcement learning because or its rubustness for the change of other agents' policies and the limitation of agent's sensing abilities.
著者
宮崎 和光 山村 雅幸 小林 重信
出版者
一般社団法人 人工知能学会
雑誌
人工知能 (ISSN:21882266)
巻号頁・発行日
vol.12, no.1, pp.78-89, 1997-01-01 (Released:2020-09-29)

Reinforcement learning is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. Profit sharing (PS) can get rewards efficiently at an initial learning phase. However, it can not always learn an optimum policy that maximizes rewards per an action. Though Q-learning is guaranteed to obtain an optimum policy, it needs numerous trials to learn it. On Markov decision processes (MDPs), if a correct environment model is identified, we can derive an optimum policy by applying Policy Iteration Algorithm (PIA). As an efficient method for identifying MDPs, k-Certainty Exploration Method has been proposed. We consider that ideal reinforcement learning systems are to get some rewards even at an initial learning phase and to get mere rewards as the identification of environments proceeds. In this paper, we propose a unified learning system : MarcoPolo which considers both getting rewards by PS or PIA and identifying the environment by k-Certainty Exploration Method. MarcoPolo can realize any tradeoff between exploitation and exploration through the whole learning process. By applying MarcoPolo to an example, its basic performance is shown. Moreover, by applying it to Sutton's maze problem and its modified version, its feasibility on more realistic domains is shown.
著者
宮崎 和光 吉田 望 森 利枝
出版者
一般社団法人 電気学会
雑誌
電気学会論文誌C(電子・情報・システム部門誌) (ISSN:03854221)
巻号頁・発行日
vol.142, no.2, pp.117-128, 2022

<p>In 2017, it became mandatory for universities in Japan to disclose their policies in degree granting (Diploma Policy: DP, hereafter) that state standards to confer degrees. Meanwhile, since 1991, nomenclature of major fields that appear in diplomas has been the responsibility of individual universities, instead of the national regulation. This study examines whether the former reasonably evokes the latter, given that both of them are deemed to represent the learning outcomes that the graduate has obtained. In order to do so, we compared the ability of humans and that of a deep-learning system (which is based on the Character-level CNN), to match DPs and major fields that are randomly given. In the examination of human ability, which was implemented with a large enough number of participants to obtain statistically significant results, we found there were a certain number of DPs that the majority of people failed to match with major fields. Given this fact, we analyzed such DPs to demonstrate that the deep learning system shows a high success rate in sorting out the DPs that poorly evoke major fields.</p>
著者
宮崎 和光 坪井 創吾 小林 重信
出版者
一般社団法人 人工知能学会
雑誌
人工知能学会論文誌 (ISSN:13460714)
巻号頁・発行日
vol.16, no.2, pp.185-192, 2001 (Released:2002-02-28)
参考文献数
10
被引用文献数
1 7

Reinforcement learning is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. In general, the purpose of reinforcement learning system is to acquire an optimum policy that can maximize expected reward per an action. However, it is not always important for any environment. Especially, if we apply reinforcement learning system to engineering, environments, we expect the agent to avoid all penalties. In Markov Decision Processes, a pair of a sensory input and an action is called rule. We call a rule penalty if and only if it has a penalty or it can transit to a penalty state where it does not contribute to get any reward. After suppressing all penalty rules, we aim to make a rational policy whose expected reward per an action is larger than zero. In this paper, we propose a suppressing penalty algorithm that can suppress any penalty and get a reward constantly. By applying the algorithm to the tick-tack-toe, its effectiveness is shown.