Megatron-LM Training Backend

Introduction

This guide explains how to use the Megatron-LM backend in siiRL for RL training. Megatron-LM is a powerful library for training very large transformer models, and integrating it as a backend allows for efficient 5D parallelism (DP/TP/EP/PP/CP).

This example demonstrates how to fine-tune a Qwen3-8B model using the GRPO algorithm with the Megatron-LM as training backend.

Step 1: Prepare the Dataset

First, ensure your dataset is in the required Parquet format. If you are using one of the example datasets like gsm8k or deepscaler, you can use the provided preprocessing scripts. For example, for deepscaler:

cd examples/data_preprocess
python3 deepscaler.py --local_dir ~/data/deepscaler

This will download and process the dataset, saving train.parquet and test.parquet in the specified directory.

Step 2: Download the Pre-trained Model

You need a base model to start training. For this example, we’ll use Qwen3-8B. Download it from Hugging Face or ModelScope to a local directory.

# For Hugging Face
huggingface-cli download Qwen/Qwen3-8B-Instruct --local-dir ~/data/models/Qwen3-8B --local-dir-use-symlinks False

# For ModelScope
modelscope download Qwen/Qwen3-8B-Instruct --local_dir ~/data/models/Qwen3-8B

Step 3: Configure and Run the Training Script

To use the Megatron-LM backend, you need to modify the training configuration in your run script.

Key Configuration Changes

The main change is to set the training strategy to megatron and configure its parallelism parameters.

Set the Strategy: e.g., in the TRAINING_CMD array, set actor_rollout_ref.actor.strategy=megatron.

Configure Parallelism: Add Megatron-specific settings for 5D parallelism. For a 8B model on a single node with 8 GPUs, you might use 2-way tensor parallelism and 4-way pipeline parallelism, with sequence parallelism enabled.

actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=4
actor_rollout_ref.actor.megatron.context_parallel_size=1
actor_rollout_ref.actor.megatron.sequence_parallel=True

Configure Distributed Optimizer: Add Megatron-specific settings for distributed optimizer. This allows for memory efficient training with ZeRO-1 optimization and is recommended for large models.
```
actor_rollout_ref.actor.megatron.use_distributed_optimizer=True
```
Configure Offloading: Add Megatron-specific settings for parameter, gradient, and optimizer offload. This allows for parameter, gradient, and optimizer offloading to CPU to save GPU memory.
```
actor_rollout_ref.actor.megatron.param_offload=True
actor_rollout_ref.actor.megatron.grad_offload=True
actor_rollout_ref.actor.megatron.optimizer_offload=True
```

Complete Training Script

Below is a complete example script, run_qwen3-8b-megatron.sh, which is adapted from the standard GRPO script to use the Megatron backend. You will need to create this script yourself or adapt an existing one.

#!/usr/bin/env bash
# ===================================================================================
# ===                       USER CONFIGURATION SECTION                            ===
# ===================================================================================

# --- For debugging
export HYDRA_FULL_ERROR=1
export SIIRL_LOG_VERBOSITY=INFO

# --- Experiment and Model Definition ---
export DATASET=deepscaler
export ALG=grpo
export MODEL_NAME=qwen3-8b

# --- Path Definitions ---
export HOME=${HOME:-"/root"} # Set your home path
export TRAIN_DATA_PATH=$HOME/data/datasets/$DATASET/train.parquet
export TEST_DATA_PATH=$HOME/data/datasets/$DATASET/test.parquet
export MODEL_PATH=$HOME/data/models/Qwen3-8B

# Base output paths
export BASE_CKPT_PATH=$HOME/ckpts
export BASE_TENSORBOARD_PATH=$HOME/tensorboard

# --- Key Training Hyperparameters ---
export TRAIN_BATCH_SIZE_PER_NODE=128
export PPO_MINI_BATCH_SIZE_PER_NODE=16
export PPO_MICRO_BATCH_SIZE_PER_GPU=8
export MAX_PROMPT_LENGTH=1024
export MAX_RESPONSE_LENGTH=2048
export ROLLOUT_GPU_MEMORY_UTILIZATION=0.45
export ROLLOUT_N=8
export SAVE_FREQ=30
export TEST_FREQ=10
export TOTAL_EPOCHS=30
export MAX_CKPT_KEEP=5

# ---- Megatron Parallelism Configuration ----
export ACTOR_REF_TP=2
export ACTOR_REF_PP=4
export ACTOR_REF_CP=1
export ACTOR_REF_SP=True

# --- Distributed Training & Infrastructure ---
export N_GPUS_PER_NODE=${N_GPUS_PER_NODE:-8}
export NNODES=${PET_NNODES:-1}
export NODE_RANK=${PET_NODE_RANK:-0}
export MASTER_ADDR=${MASTER_ADDR:-localhost}

# --- Output Paths and Experiment Naming ---
timestamp=$(date +"%Y%m%d_%H%M%S")
export CKPT_PATH=${BASE_CKPT_PATH}/${MODEL_NAME}_${ALG}_${DATASET}_megatron_${NNODES}nodes
export PROJECT_NAME=siirl_${DATASET}_${ALG}
export EXPERIMENT_NAME=siirl_${MODEL_NAME}_${ALG}_${DATASET}_megatron_experiment
export TENSORBOARD_DIR=${BASE_TENSORBOARD_PATH}/${MODEL_NAME}_${ALG}_${DATASET}_megatron_tensorboard/dlc_${NNODES}_$timestamp
export SIIRL_LOGGING_FILENAME=${MODEL_NAME}_${ALG}_${DATASET}_megatron_${NNODES}_$timestamp

# --- Calculated Global Hyperparameters ---
export TRAIN_BATCH_SIZE=$(($TRAIN_BATCH_SIZE_PER_NODE * $NNODES))
export PPO_MINI_BATCH_SIZE=$(($PPO_MINI_BATCH_SIZE_PER_NODE * $NNODES))

# --- Define the Training Command and its Arguments ---
TRAINING_CMD=(
    python3 -m siirl.main_dag
    algorithm.adv_estimator=\$ALG
    data.train_files=\$TRAIN_DATA_PATH
    data.val_files=\$TEST_DATA_PATH
    data.train_batch_size=\$TRAIN_BATCH_SIZE
    data.max_prompt_length=\$MAX_PROMPT_LENGTH
    data.max_response_length=\$MAX_RESPONSE_LENGTH
    actor_rollout_ref.model.path=\$MODEL_PATH
    actor_rollout_ref.model.enable_gradient_checkpointing=True

    # --- Megatron Backend Configuration ---
    actor_rollout_ref.actor.strategy=megatron
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=\$ACTOR_REF_TP
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=\$ACTOR_REF_PP
    actor_rollout_ref.actor.megatron.context_parallel_size=\$ACTOR_REF_CP
    actor_rollout_ref.actor.megatron.sequence_parallel=\$ACTOR_REF_SP
    actor_rollout_ref.actor.megatron.use_distributed_optimizer=True
    actor_rollout_ref.actor.megatron.param_dtype=bfloat16
    actor_rollout_ref.actor.megatron.param_offload=False

    # --- PPO & Other Hyperparameters ---
    actor_rollout_ref.actor.optim.lr=1e-6
    actor_rollout_ref.actor.ppo_mini_batch_size=\$PPO_MINI_BATCH_SIZE
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=\$PPO_MICRO_BATCH_SIZE_PER_GPU
    actor_rollout_ref.actor.grad_clip=1.0

    # --- Rollout (vLLM) Configuration ---
    actor_rollout_ref.rollout.tensor_model_parallel_size=\$ACTOR_REF_TP
    actor_rollout_ref.rollout.name=vllm
    actor_rollout_ref.rollout.gpu_memory_utilization=\$ROLLOUT_GPU_MEMORY_UTILIZATION
    actor_rollout_ref.rollout.n=\$ROLLOUT_N
    actor_rollout_ref.rollout.prompt_length=\$MAX_PROMPT_LENGTH
    actor_rollout_ref.rollout.response_length=\$MAX_RESPONSE_LENGTH

    # --- Trainer Configuration ---
    trainer.logger=['console','tensorboard']
    trainer.project_name=\$PROJECT_NAME
    trainer.experiment_name=\$EXPERIMENT_NAME
    trainer.n_gpus_per_node=\$N_GPUS_PER_NODE
    trainer.nnodes=\$NNODES
    trainer.save_freq=\$SAVE_FREQ
    trainer.test_freq=\$TEST_FREQ
    trainer.total_epochs=\$TOTAL_EPOCHS
    trainer.resume_mode=auto
    trainer.max_actor_ckpt_to_keep=\$MAX_CKPT_KEEP
    trainer.default_local_dir=\$CKPT_PATH
    trainer.val_before_train=True
)

Step 4: Checking the Results

During training, you can monitor the progress through several means:

Console Logs: The console will output detailed logs. Look for initialization messages from the Megatron backend to confirm it’s being used. You should see logs pertaining to the setup of 5D parallelism.
TensorBoard: If you enabled the tensorboard logger, you can monitor training metrics in real-time.
```
tensorboard --logdir $HOME/tensorboard
```
Navigate to the TensorBoard URL in your browser to view metrics such as reward, KL divergence, and loss curves.
Checkpoints: Checkpoints will be saved in the directory specified by CKPT_PATH. You can use these to resume training or for inference later.