基于人群的训练

基于人群的训练#

PBT 的作用#

在 同一任务 上并行训练 N 个策略（即 “人群” ）。
每个 interval_steps:
1. 保存每个策略的检查点和目标。
2. 对人群进行评分并识别 领导者 和 表现不佳者 。
3. 对表现不佳者，用来自随机领导者的权重替换，并突变选定的超参数。
4. 自动使用新的权重/参数重新启动该过程。

领导者/表现不佳者选择#

设 o_i 为每个初始化策略的目标，其均值为 μ ，标准差为 σ 。

上限和下限性能截点为：

upper_cut = max(μ + threshold_std * σ, μ + threshold_abs)
lower_cut = min(μ - threshold_std * σ, μ - threshold_abs)

领导者: o_i > upper_cut
表现不佳者: o_i < lower_cut

“自然选择” 规则:

只有表现不佳者会被处理（突变或替换）。
如果存在领导者，则用随机领导者替换表现不佳者；否则，自我突变。

突变（超参数）#

每个参数都有一个突变函数（例如， mutate_float ， mutate_discount 等）。
以 mutation_rate 的概率突变参数。
突变时，其值在 change_range = (min, max) 之间扰动。
只考虑来自PBT配置的白名单键。

示例配置#

pbt:
  enabled: True
  policy_idx: 0
  num_policies: 8
  directory: .
  workspace: "pbt_workspace"
  objective: episode.Curriculum/difficulty_level
  interval_steps: 50000000
  threshold_std: 0.1
  threshold_abs: 0.025
  mutation_rate: 0.25
  change_range: [1.1, 2.0]
  mutation:
    agent.params.config.learning_rate: "mutate_float"
    agent.params.config.grad_norm: "mutate_float"
    agent.params.config.entropy_coef: "mutate_float"
    agent.params.config.critic_coef: "mutate_float"
    agent.params.config.bounds_loss_coef: "mutate_float"
    agent.params.config.kl_threshold: "mutate_float"
    agent.params.config.gamma: "mutate_discount"
    agent.params.config.tau: "mutate_discount"

objective: episode.Curriculum/difficulty_level 是使用 infos["episode"]["Curriculum/difficulty_level"] 作为标量来 对策略进行排序 （值越大越好）的点表达式。带有 num_policies: 8 ，启动八个共享相同 workspace 和唯一 policy_idx （0-7）的进程。

启动PBT#

您必须为 每个策略启动一个进程 ，并将它们指向 相同的workspace 。为每个进程设置一个唯一的 policy_idx 和共同的 num_policies 。

您需要的最小标志：

agent.pbt.enabled=True
agent.pbt.directory=<path/to/shared_folder>
agent.pbt.policy_idx=<0..num_policies-1>

备注

所有进程必须使用相同的 agent.pbt.workspace ，以便它们可以看到彼此的检查点。

小心

PBT目前仅受 rl_games 库支持。其他RL库暂不支持。

提示#

保持检查点合理: 仅在确实需要更紧密的 PBT 节奏时才减少 interval_steps 。
使用更大的 threshold_std 和 threshold_abs 以获得更大的种群多样性。
建议运行 6 个或更多 worker 以获得 PBT 的好处。

训练示例#

我们为任务 Isaac-Dexsuite-Kuka-Allegro-Lift-v0 提供了参考 PPO 配置。为获得最佳日志记录体验，我们建议在脚本中使用 wandb 进行日志记录。

启动 N 个 worker，其中 n 表示每个 worker 的索引:

# Run this once per worker (n = 0..N-1), all pointing to the same directory/workspace
./isaaclab.sh -p scripts/reinforcement_learning/rl_games/train.py \
  --seed=<n> \
  --task=Isaac-Dexsuite-Kuka-Allegro-Lift-v0 \
  --num_envs=8192 \
  --headless \
  --track \
  --wandb-name=idx<n> \
  --wandb-entity=<**entity**> \
  --wandb-project-name=<**project**>
  agent.pbt.enabled=True \
  agent.pbt.num_policies=<N> \
  agent.pbt.policy_idx=<n> \
  agent.pbt.workspace=<**pbt_workspace_name**> \
  agent.pbt.directory=<**/path/to/shared_folder**> \

参考文献#

本PBT实现是根据 Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training (Petrenko et al., 2023). 重新实现并受到启发。

@article{petrenko2023dexpbt,
  title={Dexpbt: Scaling up dexterous manipulation for hand-arm systems with population based training},
  author={Petrenko, Aleksei and Allshire, Arthur and State, Gavriel and Handa, Ankur and Makoviychuk, Viktor},
  journal={arXiv preprint arXiv:2305.12127},
  year={2023}
}