Presentation Information
[4Yin-B-21]Target-Oriented Exploration based on the reliability to estimate the future
〇Takeshi Usami2, Tatsuji Takahashi1, Yu Kono1 (1. School of Science and Engineering, Tokyo Denki University , 2. Graduate School of Tokyo Denki University)
Keywords:
Reinforcement Learning,Machine Learning,Cognitive Science
Humans do not always pursue optimal goals, but possess a tendency toward satisficing, pursuing the achievement and maintenance of a certain goal level. In Risk-sensitive Satisficing (RS), which applies this human tendency to reinforcement learning, the agent balances the aggressiveness of exploration by leveraging the confidence score, which represents the action selection ratio. This confidence score is also used in Regional Stochastic Risk-sensitive Satisficing (RS^2), which applies RS to deep reinforcement learning. However, in deep reinforcement learning, the state-action space is vast and complex, and its implementation currently relies on approximations based on experience from states in the neighborhood of the current state. This study proposes an algorithm that updates a confidence score based on trajectories, using an update scheme analogous to that of a value function, and presents a learning implementation that accounts for future outcomes. The results of our experiment demonstrated that our method achieved better performance than conventional methods in toy tasks.
