Data Collection on Ascend Devices Based on the FSDP Backend

Last updated: 08/14/2025.

This is a tutorial for using GRPO to collect data on Ascend devices based on the FSDP backend.

Configuration

Global Collection Control: Use the configuration items in siirl/client/config/ppo_trainer.yaml to control the default collection mode.

Control collection parameters using parameters in ppo_trainer.yaml:

enable: Whether to enable performance profiling.
save_path: The path to save collected data.
level: Collection level—options include level_none, level0, level1, and level2.
level_none: Disables all level-based data collection (turns off profiler_level).
level0: Collects high-level application data, low-level NPU data, and operator execution details on the NPU.
level1: Adds CANN layer AscendCL data and AI Core performance metrics on the NPU based on level0.
level2: Adds CANN layer Runtime data and AI CPU metrics based on level1.
with memory: Enables memory analysis (defaults to True).
record shapes: Enables recording of tensor shapes (defaults to False).
with npu: Enables collection of device-side performance data (defaults to True).
with cpu: Enables collection of host-side performance data (defaults to True).
with module: Enables recording of framework-level Python call stack information.
with stack: Enables recording of operator call stack information.
analysis: Enables automatic data analysis.
discrete: Enables discrete mode, collecting performance data for each stage separately (defaults to False).
roles: Collection stage - used in conjunction with the discrete parameter. Options include:

generate, compute_reward, compute_old_log_prob, compute_ref_log_prob, compute_value, compute_advantage,

train_critic, train_actor

all_ranks: Whether to collect data from all ranks.
ranks: List of ranks for which to collect data. If empty, no data is collected.
profile_steps: List of collection steps. For example, [2, 4] indicates that steps 2 and 4 will be collected. If set to null, no data is collected.

Example

Disable collection

profiler:
  enable: False # disable profile

End-to-end collection

profiler:
  steps: [1, 2, 5]
  discrete: False

The run_qwen2_5-7b-npu-e2e_prof.sh script is provided in examples/grpo_trainer for reference.

Discrete mode collection

profiler:
  discrete: True
  roles:['generate', 'train_actor']

The discrete mode acquisition script run_qwen2_5-7b-npu-discrete_prof.sh is provided in examples/grpo_trainer for reference.

Visualization

The acquired data is stored in the user-defined save_path and can be visualized using the MindStudio Insight tool， you can refer to <https://www.hiascend.com/document/detail/zh/mindstudio/80RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html>.

If the analysis parameter is set to False, offline analysis is required after collection:

import argparse
from torch_npu.profiler.profiler import analyse

parser = argparse.ArgumentParser()
parser.add_argument("--path", type=str, default="facebook/opt-125m")

if __name__ == "__main__":
 args = parser.parse_args()
 path = args.path