Megatron-LM Training Backend
============================================

Introduction
------------

This guide explains how to use the Megatron-LM backend in siiRL for RL training. Megatron-LM is a powerful library for training very large transformer models, and integrating it as a backend allows for efficient 5D parallelism (DP/TP/EP/PP/CP).

This example demonstrates how to fine-tune a `Qwen3-8B` model using the GRPO algorithm with the Megatron-LM as training backend.

Step 1: Prepare the Dataset
---------------------------

First, ensure your dataset is in the required Parquet format. If you are using one of the example datasets like `gsm8k` or `deepscaler`, you can use the provided preprocessing scripts. For example, for `deepscaler`:

.. code:: bash

   cd examples/data_preprocess
   python3 deepscaler.py --local_dir ~/data/deepscaler

This will download and process the dataset, saving `train.parquet` and `test.parquet` in the specified directory.

Step 2: Download the Pre-trained Model
--------------------------------------

You need a base model to start training. For this example, we'll use `Qwen3-8B`. Download it from Hugging Face or ModelScope to a local directory.

.. code:: bash

   # For Hugging Face
   huggingface-cli download Qwen/Qwen3-8B-Instruct --local-dir ~/data/models/Qwen3-8B --local-dir-use-symlinks False
   
   # For ModelScope
   modelscope download Qwen/Qwen3-8B-Instruct --local_dir ~/data/models/Qwen3-8B

Step 3: Configure and Run the Training Script
---------------------------------------------

To use the Megatron-LM backend, you need to modify the training configuration in your run script.

Key Configuration Changes
~~~~~~~~~~~~~~~~~~~~~~~~~

The main change is to set the training strategy to `megatron` and configure its parallelism parameters.

1.  **Set the Strategy**: e.g., in the `TRAINING_CMD` array, set `actor_rollout_ref.actor.strategy=megatron`.
2.  **Configure Parallelism**: Add Megatron-specific settings for 5D parallelism. For a 8B model on a single node with 8 GPUs, you might use 2-way tensor parallelism and 4-way pipeline parallelism, with sequence parallelism enabled.

    .. code-block:: text

        actor_rollout_ref.actor.megatron.tensor_model_parallel_size=2
        actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=4
        actor_rollout_ref.actor.megatron.context_parallel_size=1
        actor_rollout_ref.actor.megatron.sequence_parallel=True

3.  **Configure Distributed Optimizer**: Add Megatron-specific settings for distributed optimizer. This allows for memory efficient training with ZeRO-1 optimization and is recommended for large models.

    .. code-block:: text

        actor_rollout_ref.actor.megatron.use_distributed_optimizer=True

4.  **Configure Offloading**: Add Megatron-specific settings for parameter, gradient, and optimizer offload. This allows for parameter, gradient, and optimizer offloading to CPU to save GPU memory.

    .. code-block:: text

        actor_rollout_ref.actor.megatron.param_offload=True
        actor_rollout_ref.actor.megatron.grad_offload=True
        actor_rollout_ref.actor.megatron.optimizer_offload=True

Complete Training Script
~~~~~~~~~~~~~~~~~~~~~~~~

Below is a complete example script, `run_qwen3-8b-megatron.sh`, which is adapted from the standard GRPO script to use the Megatron backend. You will need to create this script yourself or adapt an existing one.

.. code-block:: bash

    #!/usr/bin/env bash
    # ===================================================================================
    # ===                       USER CONFIGURATION SECTION                            ===
    # ===================================================================================

    # --- For debugging
    export HYDRA_FULL_ERROR=1
    export SIIRL_LOG_VERBOSITY=INFO

    # --- Experiment and Model Definition ---
    export DATASET=deepscaler
    export ALG=grpo
    export MODEL_NAME=qwen3-8b

    # --- Path Definitions ---
    export HOME=${HOME:-"/root"} # Set your home path
    export TRAIN_DATA_PATH=$HOME/data/datasets/$DATASET/train.parquet
    export TEST_DATA_PATH=$HOME/data/datasets/$DATASET/test.parquet
    export MODEL_PATH=$HOME/data/models/Qwen3-8B

    # Base output paths
    export BASE_CKPT_PATH=$HOME/ckpts
    export BASE_TENSORBOARD_PATH=$HOME/tensorboard

    # --- Key Training Hyperparameters ---
    export TRAIN_BATCH_SIZE_PER_NODE=128
    export PPO_MINI_BATCH_SIZE_PER_NODE=16
    export PPO_MICRO_BATCH_SIZE_PER_GPU=8
    export MAX_PROMPT_LENGTH=1024
    export MAX_RESPONSE_LENGTH=2048
    export ROLLOUT_GPU_MEMORY_UTILIZATION=0.45
    export ROLLOUT_N=8
    export SAVE_FREQ=30
    export TEST_FREQ=10
    export TOTAL_EPOCHS=30
    export MAX_CKPT_KEEP=5

    # ---- Megatron Parallelism Configuration ----
    export ACTOR_REF_TP=2
    export ACTOR_REF_PP=4
    export ACTOR_REF_CP=1
    export ACTOR_REF_SP=True

    # --- Distributed Training & Infrastructure ---
    export N_GPUS_PER_NODE=${N_GPUS_PER_NODE:-8}
    export NNODES=${PET_NNODES:-1}
    export NODE_RANK=${PET_NODE_RANK:-0}
    export MASTER_ADDR=${MASTER_ADDR:-localhost}

    # --- Output Paths and Experiment Naming ---
    timestamp=$(date +"%Y%m%d_%H%M%S")
    export CKPT_PATH=${BASE_CKPT_PATH}/${MODEL_NAME}_${ALG}_${DATASET}_megatron_${NNODES}nodes
    export PROJECT_NAME=siirl_${DATASET}_${ALG}
    export EXPERIMENT_NAME=siirl_${MODEL_NAME}_${ALG}_${DATASET}_megatron_experiment
    export TENSORBOARD_DIR=${BASE_TENSORBOARD_PATH}/${MODEL_NAME}_${ALG}_${DATASET}_megatron_tensorboard/dlc_${NNODES}_$timestamp
    export SIIRL_LOGGING_FILENAME=${MODEL_NAME}_${ALG}_${DATASET}_megatron_${NNODES}_$timestamp

    # --- Calculated Global Hyperparameters ---
    export TRAIN_BATCH_SIZE=$(($TRAIN_BATCH_SIZE_PER_NODE * $NNODES))
    export PPO_MINI_BATCH_SIZE=$(($PPO_MINI_BATCH_SIZE_PER_NODE * $NNODES))

    # --- Define the Training Command and its Arguments ---
    TRAINING_CMD=(
        python3 -m siirl.main_dag
        algorithm.adv_estimator=\$ALG
        data.train_files=\$TRAIN_DATA_PATH
        data.val_files=\$TEST_DATA_PATH
        data.train_batch_size=\$TRAIN_BATCH_SIZE
        data.max_prompt_length=\$MAX_PROMPT_LENGTH
        data.max_response_length=\$MAX_RESPONSE_LENGTH
        actor_rollout_ref.model.path=\$MODEL_PATH
        actor_rollout_ref.model.enable_gradient_checkpointing=True
        
        # --- Megatron Backend Configuration ---
        actor_rollout_ref.actor.strategy=megatron
        actor_rollout_ref.actor.megatron.tensor_model_parallel_size=\$ACTOR_REF_TP
        actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=\$ACTOR_REF_PP
        actor_rollout_ref.actor.megatron.context_parallel_size=\$ACTOR_REF_CP
        actor_rollout_ref.actor.megatron.sequence_parallel=\$ACTOR_REF_SP
        actor_rollout_ref.actor.megatron.use_distributed_optimizer=True
        actor_rollout_ref.actor.megatron.param_dtype=bfloat16
        actor_rollout_ref.actor.megatron.param_offload=False
        
        # --- PPO & Other Hyperparameters ---
        actor_rollout_ref.actor.optim.lr=1e-6
        actor_rollout_ref.actor.ppo_mini_batch_size=\$PPO_MINI_BATCH_SIZE
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=\$PPO_MICRO_BATCH_SIZE_PER_GPU
        actor_rollout_ref.actor.grad_clip=1.0
        
        # --- Rollout (vLLM) Configuration ---
        actor_rollout_ref.rollout.tensor_model_parallel_size=\$ACTOR_REF_TP
        actor_rollout_ref.rollout.name=vllm
        actor_rollout_ref.rollout.gpu_memory_utilization=\$ROLLOUT_GPU_MEMORY_UTILIZATION
        actor_rollout_ref.rollout.n=\$ROLLOUT_N
        actor_rollout_ref.rollout.prompt_length=\$MAX_PROMPT_LENGTH  
        actor_rollout_ref.rollout.response_length=\$MAX_RESPONSE_LENGTH
        
        # --- Trainer Configuration ---
        trainer.logger=['console','tensorboard']
        trainer.project_name=\$PROJECT_NAME
        trainer.experiment_name=\$EXPERIMENT_NAME
        trainer.n_gpus_per_node=\$N_GPUS_PER_NODE
        trainer.nnodes=\$NNODES
        trainer.save_freq=\$SAVE_FREQ
        trainer.test_freq=\$TEST_FREQ
        trainer.total_epochs=\$TOTAL_EPOCHS
        trainer.resume_mode=auto
        trainer.max_actor_ckpt_to_keep=\$MAX_CKPT_KEEP
        trainer.default_local_dir=\$CKPT_PATH
        trainer.val_before_train=True
    )

Step 4: Checking the Results
----------------------------

During training, you can monitor the progress through several means:

1.  **Console Logs**: The console will output detailed logs. Look for initialization messages from the Megatron backend to confirm it's being used. You should see logs pertaining to the setup of 5D parallelism.

2.  **TensorBoard**: If you enabled the `tensorboard` logger, you can monitor training metrics in real-time.
    
    .. code:: bash

       tensorboard --logdir $HOME/tensorboard

    Navigate to the TensorBoard URL in your browser to view metrics such as reward, KL divergence, and loss curves.

3.  **Checkpoints**: Checkpoints will be saved in the directory specified by `CKPT_PATH`. You can use these to resume training or for inference later.