MetaX(沐曦) GPU
===============

SiiRL is also supports for MetaX's GPU devices. This guide has been tested with the following hardware:

- 曦云 series GPU

Installation Process
--------------------

Recommended Base Image
^^^^^^^^^^^^^^^^^^^^^^

For a smoother setup, we strongly recommend using our pre-built Docker image, which includes all necessary dependencies. Please refer to MetaX developer website: https://developer.metax-tech.com/softnova/docker, after pulling it you only need to install siiRL framework from source.

.. code-block:: bash

    docker pull siiai/siirl-metax:maca.ai3.1.0.1-torch2.6-py310-ubuntu22.04-amd64

Start docker container
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash
    
    docker run -d -t --net=host --uts=host --ipc=host --privileged=true --group-add video \
    --shm-size 100gb --ulimit memlock=-1 --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined --device=/dev/dri --device=/dev/mxcd --device=/dev/infiniband \
    -v /data/:/data/ \
    --name siirl \
    siiai/siirl-metax:maca.ai3.1.0.1-torch2.6-py310-ubuntu22.04-amd64 bash

SiiRL Installation
^^^^^^^^^^^^^^^^^^

Finally, install the siiRL framework itself. DO NOT use the pip install command to install siiRL, it will cause dependency conflicts.

.. code-block:: bash

    git clone https://github.com/sii-research/siiRL.git
    cd siirl
    # You need to comment out the libraries adapted for MetaX, such as ray and vllm, to prevent them from being overwritten.
    # vllm>=0.8.5.post1
    # ray[default]>=2.47.1
    pip install -r requirements.txt
    pip install -e .

Add environment variables for MetaX
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # mx gpu env
    export MACA_PATH=/opt/maca
    export CUCC_PATH=${MACA_PATH}/tools/cu-bridge
    export CUDA_PATH=${CUCC_PATH}
    export MACA_CLANG_PATH=$MACA_PATH/mxgpu_llvm/bin
    export PATH=${CUDA_PATH}/bin:${MACA_CLANG_PATH}:${PATH}
    export LD_LIBRARY_PATH=${MACA_PATH}/tools/cu-bridge/lib/:${MACA_PATH}/lib:${MACA_PATH}/ompi/lib:${MACA_PATH}/mxgpu_llvm/lib:${LD_LIBRARY_PATH}
    export PYTORCH_ENABLE_SAME_RAND_A100=1
    export MCPYTORCH_DISABLE_PRINT=1
    export MAX_JOBS=20
    export VLLM_USE_V1=0
    export MCCL_ENABLE_FC=0
    export MCCL_MAX_NCHANNELS=8
    export PYTHONUNBUFFERED=1

    export MCCL_IB_HCA=mlx5
    export MCCL_SOCKET_IFNAME=ens1
    export GLOO_SOCKET_IFNAME=ens1
    export SOCKET_NIC=ens1

Verification with a Quick Start Example
---------------------------------------

To ensure your setup is correct, we recommend performing a quick test run. The following example trains a Qwen2.5-0.5B model on the GSM8k dataset using the GRPO algorithm.

1.  **Prepare the Dataset**
    First, download and preprocess the GSM8k dataset. The provided script will convert it to the Parquet format required by the framework.

.. code-block:: bash

    python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k

2.  **Run the Training Job**
    Next, execute the training command below. Ensure you have set the `VLLM_ATTENTION_BACKEND` environment variable.

.. code-block:: bash

    # --- Experiment and Model Definition ---
    export DATASET=gsm8k
    export ALG=grpo
    export MODEL_NAME=qwen2.5-05b

    # --- Path Definitions ---
    export HOME=/data/
    export TRAIN_DATA_PATH=$HOME/$DATASET/train.parquet
    export TEST_DATA_PATH=$HOME/$DATASET/test.parquet
    export MODEL_PATH=$HOME/Qwen2.5-0.5B-Instruct

    # Base output paths
    export BASE_CKPT_PATH=ckpts
    export BASE_TENSORBOARD_PATH=tensorboard

    # --- Key Training Hyperparameters ---
    export TRAIN_BATCH_SIZE_PER_NODE=512
    export PPO_MINI_BATCH_SIZE_PER_NODE=256
    export PPO_MICRO_BATCH_SIZE_PER_GPU=8
    export MAX_PROMPT_LENGTH=1024
    export MAX_RESPONSE_LENGTH=2048
    export ROLLOUT_GPU_MEMORY_UTILIZATION=0.4
    export ROLLOUT_TP=2
    export ROLLOUT_N=8
    export SAVE_FREQ=30
    export TEST_FREQ=10
    export TOTAL_EPOCHS=30
    export MAX_CKPT_KEEP=5

    # --- Multi-node (Multi-machine) distributed training environments ---

    # Uncomment the following line and set the correct network interface if needed for distributed backend

    # --- Distributed Training & Infrastructure ---
    export N_GPUS_PER_NODE=${N_GPUS_PER_NODE:-8}
    export NNODES=${PET_NNODES:-1}
    export NODE_RANK=${PET_NODE_RANK:-0}
    export MASTER_ADDR=${MASTER_ADDR:-localhost}

    # --- Output Paths and Experiment Naming ---
    export CKPT_PATH=${BASE_CKPT_PATH}/${MODEL_NAME}_${ALG}_${DATASET}_hybrid_${NNODES}nodes
    export PROJECT_NAME=siirl_${DATASET}_${ALG}
    export EXPERIMENT_NAME=siirl_${MODEL_NAME}_${ALG}_${DATASET}_experiment
    export TENSORBOARD_DIR=${BASE_TENSORBOARD_PATH}/${MODEL_NAME}_${ALG}_${DATASET}_hybrid_tensorboard/dlc_${NNODES}_$timestamp
    export SIIRL_LOGGING_FILENAME=${MODEL_NAME}_${ALG}_${DATASET}_hybrid_${NNODES}_$timestamp

    # --- Calculated Global Hyperparameters ---
    export TRAIN_BATCH_SIZE=$(($TRAIN_BATCH_SIZE_PER_NODE * $NNODES))
    export PPO_MINI_BATCH_SIZE=$(($PPO_MINI_BATCH_SIZE_PER_NODE * $NNODES))

    # mx gpu env
    export MACA_PATH=/opt/maca
    export CUCC_PATH=${MACA_PATH}/tools/cu-bridge
    export CUDA_PATH=${CUCC_PATH}
    export MACA_CLANG_PATH=$MACA_PATH/mxgpu_llvm/bin
    export PATH=${CUDA_PATH}/bin:${MACA_CLANG_PATH}:${PATH}
    export LD_LIBRARY_PATH=${MACA_PATH}/tools/cu-bridge/lib/:${MACA_PATH}/lib:${MACA_PATH}/ompi/lib:${MACA_PATH}/mxgpu_llvm/lib:${LD_LIBRARY_PATH}
    export PYTORCH_ENABLE_SAME_RAND_A100=1
    export MCPYTORCH_DISABLE_PRINT=1
    export MAX_JOBS=20
    export VLLM_USE_V1=0
    export MCCL_ENABLE_FC=0

    export MCCL_MAX_NCHANNELS=8
    export PYTHONUNBUFFERED=1
    export MCCL_IB_HCA=mlx5
    export MCCL_SOCKET_IFNAME=ens1
    export GLOO_SOCKET_IFNAME=ens1
    export SOCKET_NIC=ens1

    # --- Define the Training Command and its Arguments ---
    TRAINING_CMD=(
        python3 -m siirl.main_dag
        algorithm.adv_estimator=\$ALG
        data.train_files=\$TRAIN_DATA_PATH
        data.val_files=\$TEST_DATA_PATH
        data.train_batch_size=\$TRAIN_BATCH_SIZE
        data.max_prompt_length=\$MAX_PROMPT_LENGTH
        data.max_response_length=\$MAX_RESPONSE_LENGTH
        data.filter_overlong_prompts=True
        data.truncation='error'
        data.shuffle=False
        actor_rollout_ref.model.path=\$MODEL_PATH
        actor_rollout_ref.actor.optim.lr=1e-6
        actor_rollout_ref.model.use_remove_padding=True
        actor_rollout_ref.model.use_fused_kernels=False
        actor_rollout_ref.actor.policy_drift_coeff=0.001
        actor_rollout_ref.actor.ppo_mini_batch_size=\$PPO_MINI_BATCH_SIZE
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=\$PPO_MICRO_BATCH_SIZE_PER_GPU
        actor_rollout_ref.actor.use_kl_loss=True
        actor_rollout_ref.actor.grad_clip=0.5
        actor_rollout_ref.actor.clip_ratio=0.2
        actor_rollout_ref.actor.kl_loss_coef=0.01
        actor_rollout_ref.actor.kl_loss_type=low_var_kl
        actor_rollout_ref.model.enable_gradient_checkpointing=True
        actor_rollout_ref.actor.fsdp_config.param_offload=True
        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=\$PPO_MICRO_BATCH_SIZE_PER_GPU
        actor_rollout_ref.rollout.tensor_model_parallel_size=\$ROLLOUT_TP
        actor_rollout_ref.rollout.name=vllm
        actor_rollout_ref.rollout.gpu_memory_utilization=\$ROLLOUT_GPU_MEMORY_UTILIZATION
        actor_rollout_ref.rollout.max_model_len=\$MAX_RESPONSE_LENGTH
        actor_rollout_ref.rollout.enable_chunked_prefill=False
        actor_rollout_ref.rollout.enforce_eager=False
        actor_rollout_ref.rollout.free_cache_engine=False
        actor_rollout_ref.rollout.n=\$ROLLOUT_N
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=\$PPO_MICRO_BATCH_SIZE_PER_GPU
        actor_rollout_ref.ref.fsdp_config.param_offload=True
        algorithm.weight_factor_in_cpgd='STD_weight'
        algorithm.kl_ctrl.kl_coef=0.001
        trainer.critic_warmup=0
        trainer.logger=['console','tensorboard']
        trainer.project_name=\$PROJECT_NAME
        trainer.experiment_name=\$EXPERIMENT_NAME
        trainer.n_gpus_per_node=\$N_GPUS_PER_NODE
        trainer.nnodes=\$NNODES
        trainer.save_freq=\$SAVE_FREQ
        trainer.test_freq=\$TEST_FREQ
        trainer.total_epochs=\$TOTAL_EPOCHS
        trainer.resume_mode=auto
        trainer.max_actor_ckpt_to_keep=\$MAX_CKPT_KEEP
        trainer.default_local_dir=\$CKPT_PATH
        trainer.val_before_train=False
    )

    # ===================================================================================
    # ===                  MAIN EXECUTION LOGIC & INFRASTRUCTURE                      ===
    # ===================================================================================

    # --- Boilerplate Setup ---
    set -e
    set -o pipefail
    set -x

    # --- Infrastructure & Boilerplate Functions ---
    start_ray_cluster() {
        local RAY_HEAD_WAIT_TIMEOUT=600
        export RAY_RAYLET_NODE_MANAGER_CONFIG_NIC_NAME=${INTERFACE_NAME}
        export RAY_GCS_SERVER_CONFIG_NIC_NAME=${INTERFACE_NAME}
        export RAY_RUNTIME_ENV_AGENT_CREATION_TIMEOUT_S=1200
        export RAY_GCS_RPC_CLIENT_CONNECT_TIMEOUT_S=120

        local ray_start_common_opts=(
            --num-gpus "$N_GPUS_PER_NODE"
            --object-store-memory 100000000000
            --memory 100000000000
        )

        if [ "$NNODES" -gt 1 ]; then
            if [ "$NODE_RANK" = "0" ]; then
                echo "INFO: Starting Ray head node on $(hostname)..."
                export RAY_ADDRESS="$RAY_MASTER_ADDR:$RAY_MASTER_PORT"
                ray start --head --port="$RAY_MASTER_PORT" --dashboard-port="$RAY_DASHBOARD_PORT" "${ray_start_common_opts[@]}" --system-config='{"gcs_server_request_timeout_seconds": 60, "gcs_rpc_server_reconnect_timeout_s": 60}'
                local start_time=$(date +%s)
                while ! ray health-check --address "$RAY_ADDRESS" &>/dev/null; do
                    if [ "$(( $(date +%s) - start_time ))" -ge "$RAY_HEAD_WAIT_TIMEOUT" ]; then echo "ERROR: Timed out waiting for head node. Exiting." >&2; ray stop --force; exit 1; fi
                    echo "Head node not healthy yet. Retrying in 5s..."
                    sleep 5
                done
                echo "INFO: Head node is healthy."
            else
                local head_node_address="$MASTER_ADDR:$RAY_MASTER_PORT"
                echo "INFO: Worker node $(hostname) waiting for head at $head_node_address..."
                local start_time=$(date +%s)
                while ! ray health-check --address "$head_node_address" &>/dev/null; do
                    if [ "$(( $(date +%s) - start_time ))" -ge "$RAY_HEAD_WAIT_TIMEOUT" ]; then echo "ERROR: Timed out waiting for head. Exiting." >&2; exit 1; fi
                    echo "Head not healthy yet. Retrying in 5s..."
                    sleep 5
                done
                echo "INFO: Head is healthy. Worker starting..."
                ray start --address="$head_node_address" "${ray_start_common_opts[@]}"
            fi
        else
            echo "INFO: Starting Ray in single-node mode..."
            ray start --head "${ray_start_common_opts[@]}"
        fi
    }

    # --- Main Execution Function ---
    main() {
        local timestamp=$(date +"%Y%m%d_%H%M%S")
        ray stop --force

        # export VLLM_USE_V1=0
        export GLOO_SOCKET_TIMEOUT=600
        export GLOO_TCP_TIMEOUT=600
        export GLOO_LOG_LEVEL=DEBUG
        export RAY_MASTER_PORT=${RAY_MASTER_PORT:-6379}
        export RAY_DASHBOARD_PORT=${RAY_DASHBOARD_PORT:-8265}
        export RAY_MASTER_ADDR=$MASTER_ADDR
        
        start_ray_cluster

        if [ "$NNODES" -gt 1 ] && [ "$NODE_RANK" = "0" ]; then
            echo "Waiting for all $NNODES nodes to join..."
            local TIMEOUT=600; local start_time=$(date +%s)
            while true; do
                if [ "$(( $(date +%s) - start_time ))" -ge "$TIMEOUT" ]; then echo "Error: Timeout waiting for nodes." >&2; exit 1; fi
                local ready_nodes=$(ray list nodes --format=json | python3 -c "import sys, json; print(len(json.load(sys.stdin)))")
                if [ "$ready_nodes" -ge "$NNODES" ]; then break; fi
                echo "Waiting... ($ready_nodes / $NNODES nodes ready)"
                sleep 5
            done
            echo "All $NNODES nodes have joined."
        fi

        if [ "$NODE_RANK" = "0" ]; then
            echo "INFO [RANK 0]: Starting main training command."
            eval "${TRAINING_CMD[@]}" "$@"
            echo "INFO [RANK 0]: Training finished."
            sleep 30; ray stop --force >/dev/null 2>&1
        elif [ "$NNODES" -gt 1 ]; then
            local head_node_address="$MASTER_ADDR:$RAY_MASTER_PORT"
            echo "INFO [RANK $NODE_RANK]: Worker active. Monitoring head node at $head_node_address."
            while ray health-check --address "$head_node_address" &>/dev/null; do sleep 15; done
            echo "INFO [RANK $NODE_RANK]: Head node down. Exiting."
        fi

        echo "INFO: Script finished on rank $NODE_RANK."
    }

    # --- Script Entrypoint ---
    main "$@"
    !/usr/bin/env bash