Ascend NPU
SiiRL is also supports for Huawei’s Ascend NPU devices. This guide has been tested with the following hardware: - Atlas 200T A2 Box16
Installation Process
Core Environment Requirements
Ensure your environment meets these core software version requirements:
Software |
Version |
Python |
== 3.10 |
CANN |
== 8.1.RC1 |
PyTorch |
== 2.5.1 |
torch_npu |
== 2.5.1.RC1 |
Recommended Base Image
For a smoother setup, we strongly recommend using our pre-built Docker image, which includes all necessary dependencies. Please note this pre-built docker image contains torch, torch-npu, vLLM and vLLM-Ascend packages, after pulling it you only need to install siiRL framework from source.
docker pull crispig/verl_npu:cann8.1rc1-py3.10-torch2.5.1-vllm-ascend0.7.3.post1-250616
Compiling vLLM and vllm-ascend [Optional]
Proper integration of vLLM within siiRL requires compiling both vllm and vllm-ascend from source. Follow the steps below, paying close attention to the instructions specific to your hardware.
# vllm
git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt
# For Atlas 200T A2 Box16
VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
# vllm-ascend
git clone -b v0.7.3.post1 --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export COMPILE_CUSTOM_KERNELS=1
python setup.py install
SiiRL Installation
Finally, install the siiRL framework itself. DO NOT use the pip install command to install siiRL, it will cause dependency conflicts.
git clone https://github.com/sii-research/siiRL.git
cd siirl
pip install -e .
Third-Party Library Considerations
Please be aware of the following specific requirements and limitations for certain libraries on Ascend hardware:
Software |
Description |
transformers |
v4.52.4 |
flash_attn |
not supported |
liger-kernel |
not supported |
tensordict |
0.8.3 (ARM) |
Using –flash_attention_2 through transformers is supported (requires transformers version >= 4.52.0).
Flash Attention acceleration via the flash_attn package is not supported.
liger-kernel is not supported.
For ARM servers, tensordict version 0.8.3 is required. You can manually install it after the main dependencies are installed.
For x86 servers, the CPU version of torchvision must be installed.
pip install torchvision==0.20.1+cpu --index-url https://download.pytorch.org/whl/cpu
Verification with a Quick Start Example
To ensure your setup is correct, we recommend performing a quick test run. The following example trains a Qwen2.5-0.5B model on the GSM8k dataset using the GRPO algorithm.
Prepare the Dataset First, download and preprocess the GSM8k dataset. The provided script will convert it to the Parquet format required by the framework.
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
Run the Training Job Next, execute the training command below. Ensure you have set the VLLM_ATTENTION_BACKEND environment variable.
set -x
python3 -m siirl.client.main_dag \
algorithm.adv_estimator=grpo \
data.train_files=/datasets/gsm8k/train.parquet\
data.val_files=/datasets/gsm8k/teset.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=/models/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=5e-8 \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='siirl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_7b_function_rm' \
trainer.n_gpus_per_node=16 \
trainer.nnodes=$NNODES \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=300 \
trainer.device=npu $@