UPM Institutional Repository

Dual experience replay enhanced deep deterministic policy gradient for efficient continuous data sampling


Citation

Mohd Aris, Teh Noranis and Chen, Ningning and Mustapha, Norwati and Zolkepli, Maslina (2025) Dual experience replay enhanced deep deterministic policy gradient for efficient continuous data sampling. PLOS ONE, 20. art. no. e0334411. pp. 1-18. ISSN 1932-6203

Abstract

To address the inefficiencies in sample utilization and policy instability in asynchronous distributed reinforcement learning, we propose TPDEB—a dual experience replay framework that integrates prioritized sampling and temporal diversity. While recent distributed RL systems have scaled well, they often suffer from instability and inefficient sampling under network-induced delays and stale policy updates—highlighting a gap in robust learning under asynchronous conditions. TPDEB significantly improves convergence speed and robustness by coordinating dual-buffer updates across distributed agents, offering a scalable solution to real-world continuous control tasks. TPDEB addresses these limitations through two key mechanisms: a trajectory-level prioritized replay buffer that captures temporally coherent high-value experiences, and KL-regularized learning that constrains policy drift across actors. Unlike prior approaches relying on a single experience buffer, TPDEB employs a dual-buffer strategy that combines standard and prioritized replay Buffers. This enables better trade-offs between unbiased sampling and value-driven prioritization, improving learning robustness under asynchronous actor updates. Moreover, TPDEB collects more diverse and redundant experience by scaling parallel actor replicas. Empirical evaluations on MuJoCo continuous control benchmarks demonstrate that TPDEB outperforms baseline distributed algorithms in both convergence speed and final performance, especially under constrained actor–learner bandwidth. Ablation studies validate the contribution of each component, showing that trajectory-level prioritization captures high-quality samples more effectively than step-wise methods, and KL-regularization enhances stability across asynchronous updates. These findings support TPDEB as a practical and scalable solution for distributed reinforcement learning systems.


Download File

[img] Text
124660.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB)
Official URL or Download Paper: https://dx.plos.org/10.1371/journal.pone.0334411

Additional Metadata

Item Type: Article
Subject: Multidisciplinary
Divisions: Faculty of Computer Science and Information Technology
DOI Number: https://doi.org/10.1371/journal.pone.0334411
Publisher: Public Library of Science
Keywords: Deep Deterministic Policy Gradient; Dual Experience Replay; Prioritized Sampling; Temporal Diversity; Asynchronous Distributed Reinforcement Learning; Continuous Control; Sample Utilization; Policy Instability; KL-Regularization; Trajectory-Level Prioritization
Sustainable Development Goals (SDGs): SDG 9: Industry, Innovation and Infrastructure, SDG 17: Partnerships for the Goals, SDG 4: Quality Education
Depositing User: MS. HADIZAH NORDIN
Date Deposited: 21 Apr 2026 06:44
Last Modified: 21 Apr 2026 06:44
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.1371/journal.pone.0334411
URI: http://psasir.upm.edu.my/id/eprint/124660
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item