Ascend NPU

SiiRL is also supports for Huawei’s Ascend NPU devices. This guide has been tested with the following hardware: - Atlas 200T A2 Box16

Installation Process

Core Environment Requirements

Ensure your environment meets these core software version requirements:

Software	Version
Python	== 3.10
CANN	== 8.1.RC1
PyTorch	== 2.5.1
torch_npu	== 2.5.1.RC1

Recommended Base Image

For a smoother setup, we strongly recommend using our pre-built Docker image, which includes all necessary dependencies. Please note this pre-built docker image contains torch, torch-npu, vLLM and vLLM-Ascend packages, after pulling it you only need to install siiRL framework from source.

docker pull crispig/verl_npu:cann8.1rc1-py3.10-torch2.5.1-vllm-ascend0.7.3.post1-250616

Compiling vLLM and vllm-ascend [Optional]

Proper integration of vLLM within siiRL requires compiling both vllm and vllm-ascend from source. Follow the steps below, paying close attention to the instructions specific to your hardware.

# vllm
git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt

# For Atlas 200T A2 Box16
VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/

# vllm-ascend
git clone -b v0.7.3.post1 --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export COMPILE_CUSTOM_KERNELS=1
python setup.py install

SiiRL Installation

Finally, install the siiRL framework itself. DO NOT use the pip install command to install siiRL, it will cause dependency conflicts.

git clone https://github.com/sii-research/siiRL.git
cd siirl
pip install -e .

Third-Party Library Considerations

Please be aware of the following specific requirements and limitations for certain libraries on Ascend hardware:

Software	Description
transformers	v4.52.4
flash_attn	not supported
liger-kernel	not supported
tensordict	0.8.3 (ARM)

Using –flash_attention_2 through transformers is supported (requires transformers version >= 4.52.0).
Flash Attention acceleration via the flash_attn package is not supported.
liger-kernel is not supported.
For ARM servers, tensordict version 0.8.3 is required. You can manually install it after the main dependencies are installed.
For x86 servers, the CPU version of torchvision must be installed.

pip install torchvision==0.20.1+cpu --index-url https://download.pytorch.org/whl/cpu

Verification with a Quick Start Example

To ensure your setup is correct, we recommend performing a quick test run. The following example trains a Qwen2.5-0.5B model on the GSM8k dataset using the GRPO algorithm.

Prepare the Dataset First, download and preprocess the GSM8k dataset. The provided script will convert it to the Parquet format required by the framework.

python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k

Run the Training Job Next, execute the training command below. Ensure you have set the VLLM_ATTENTION_BACKEND environment variable.

set -x

python3 -m siirl.client.main_dag \
    algorithm.adv_estimator=grpo \
    data.train_files=/datasets/gsm8k/train.parquet\
    data.val_files=/datasets/gsm8k/teset.parquet \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=/models/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=5e-8 \
    actor_rollout_ref.model.use_remove_padding=False \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console'] \
    trainer.project_name='siirl_grpo_example_gsm8k' \
    trainer.experiment_name='qwen2_7b_function_rm' \
    trainer.n_gpus_per_node=16 \
    trainer.nnodes=$NNODES \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=300 \
    trainer.device=npu $@