A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker

Timur Ishuov, Zhenis Otarbay, Michele Folgheraiter

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

After a quick overview of convergence issues in the Deep Deterministic Policy Gradient (DDPG) which is based on the Deterministic Policy Gradient (DPG), we put forward a peculiar non-obvious hypothesis that 1) DDPG can be type of on-policy learning and acting algorithm if we consider rewards from mini-batch sample as a relatively stable average reward during a limited time period and a fixed Target Network as fixed actor and critic for the limited time period, and 2) an overestimation in DDPG with the fixed Target Network within specified time may not be an out-of-boundary behavior for low dimensional tasks but a process of reaching regions close to the real Q value's average before converging to better Q values. To empirically show that DDPG with a fixed or stable Target may not exceed Q value limits during training in the OpenAI's Pendulum-v1 Environment, we simplified ideas of Backward Q-learning which combined on-policy and off-policy learning, calling this concept as a unbiased Deep Deterministic Policy Gradient (uDDPG) algorithm. In uDDPG we separately train the Target Network on actual Q values or discounted rewards between episodes (hence 'unbiased' in the abbreviation). uDDPG is an anchored version of DDPG. We also use simplified Advantage or difference between current Q Network gradient over actions and current simple moving average of this gradient in updating Action Network. Our purpose is to eventually introduce a less biased, more stable version of DDPG. uDDPG version (DDPG-II) with a function 'supernaturally' obtained during experiments that damps weaker fluctuations during policy updates showed promising convergence results.

Original languageEnglish
Title of host publicationSIST 2022 - 2022 International Conference on Smart Information Systems and Technologies, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665467902
DOIs
Publication statusPublished - 2022
Event2022 International Conference on Smart Information Systems and Technologies, SIST 2022 - Nur-Sultan, Kazakhstan
Duration: Apr 28 2022Apr 30 2022

Publication series

NameSIST 2022 - 2022 International Conference on Smart Information Systems and Technologies, Proceedings

Conference

Conference2022 International Conference on Smart Information Systems and Technologies, SIST 2022
Country/TerritoryKazakhstan
CityNur-Sultan
Period4/28/224/30/22

Keywords

  • Deep Deterministic Policy Gradient
  • Reinforcement Learning

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Information Systems
  • Health Informatics

Fingerprint

Dive into the research topics of 'A Concept of Unbiased Deep Deterministic Policy Gradient for Better Convergence in Bipedal Walker'. Together they form a unique fingerprint.

Cite this