TY - GEN
T1 - A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker
AU - Ishuov, Timur
AU - Otarbay, Zhenis
AU - Folgheraiter, Michele
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence 'unbiased' in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function 'supernaturally' obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.
AB - After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence 'unbiased' in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function 'supernaturally' obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.
KW - Deep Deterministic Policy Gradient
KW - Reinforcement Learning
UR - http://www.scopus.com/inward/record.url?scp=85143434756&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143434756&partnerID=8YFLogxK
U2 - 10.1109/SIST54437.2022.9945743
DO - 10.1109/SIST54437.2022.9945743
M3 - Conference contribution
AN - SCOPUS:85143434756
T3 - SIST 2022 - 2022 International Conference on Smart Information Systems and Technologies, Proceedings
BT - SIST 2022 - 2022 International Conference on Smart Information Systems and Technologies, Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 International Conference on Smart Information Systems and Technologies, SIST 2022
Y2 - 28 April 2022 through 30 April 2022
ER -