Configuration Guide

siiRL uses Hydra-based configuration management with dataclass parameters. All configuration parameters are defined in the siirl/params/ directory and can be set via command-line arguments.

Configuration Structure

Parameters are organized into the following modules:

  • DataArguments: Data-related parameters (siirl/params/data_args.py)

  • ActorRolloutRefArguments: Actor, Rollout, and Reference model parameters (siirl/params/model_args.py)

  • CriticArguments: Critic model parameters (siirl/params/model_args.py)

  • RewardModelArguments: Reward model parameters (siirl/params/model_args.py)

  • AlgorithmArguments: RL algorithm parameters (siirl/params/model_args.py)

  • TrainingArguments: Training configuration (siirl/params/training_args.py)

  • DAGArguments: DAG workflow parameters (siirl/params/dag_args.py)

  • ProfilerArguments: Profiling parameters (siirl/params/profiler_args.py)

All parameters are combined into the SiiRLArguments class.

Usage

Parameters are set via command-line arguments using dot notation:

python -m siirl.main_dag \
  data.train_files=/path/to/train.parquet \
  data.train_batch_size=512 \
  actor_rollout_ref.model.path=/path/to/model \
  algorithm.adv_estimator=grpo \
  trainer.total_epochs=30

Data Parameters

Location: siirl/params/data_args.py

data.tokenizer=null
data.train_files=/path/to/train.parquet
data.val_files=/path/to/val.parquet
data.prompt_key=prompt
data.max_prompt_length=512
data.max_response_length=512
data.train_batch_size=1024
data.return_raw_input_ids=False
data.return_raw_chat=False
data.return_full_prompt=False
data.shuffle=True
data.filter_overlong_prompts=False
data.filter_overlong_prompts_workers=1
data.truncation=error
data.image_key=images
data.trust_remote_code=True

Key Parameters:

  • data.train_files: Training data file path (Parquet format, can be list or single file)

  • data.val_files: Validation data file path

  • data.prompt_key: Field name for prompt in dataset (default: “prompt”)

  • data.max_prompt_length: Maximum prompt length (left-padded)

  • data.max_response_length: Maximum response length for rollout generation

  • data.train_batch_size: Training batch size per iteration

  • data.return_raw_input_ids: Return original input_ids without chat template (for different RM chat templates)

  • data.shuffle: Whether to shuffle data

  • data.truncation: Truncation strategy (“error”, “left”, “right”, “middle”)

  • data.trust_remote_code: Allow remote code execution for tokenizers

Custom Dataset

data.custom_cls.path=/path/to/custom_dataset.py
data.custom_cls.name=MyDatasetClass
  • data.custom_cls.path: Path to custom dataset class file

  • data.custom_cls.name: Name of the dataset class

Actor/Rollout/Reference Model

Location: siirl/params/model_args.py

Model Configuration

actor_rollout_ref.hybrid_engine=True
actor_rollout_ref.model.path=/path/to/model
actor_rollout_ref.model.external_lib=null
actor_rollout_ref.model.enable_gradient_checkpointing=False
actor_rollout_ref.model.enable_activation_offload=False
actor_rollout_ref.model.trust_remote_code=False
actor_rollout_ref.model.use_remove_padding=False
  • actor_rollout_ref.model.path: Huggingface model path (local or HDFS)

  • actor_rollout_ref.model.external_lib: Additional Python packages to import

  • actor_rollout_ref.model.enable_gradient_checkpointing: Enable gradient checkpointing

  • actor_rollout_ref.model.enable_activation_offload: Enable activation offloading

  • actor_rollout_ref.model.trust_remote_code: Allow remote code model loading

  • actor_rollout_ref.model.use_remove_padding: Remove padding tokens for efficiency

Actor Configuration

actor_rollout_ref.actor.strategy=fsdp
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.actor.grad_clip=1.0
actor_rollout_ref.actor.clip_ratio=0.2
actor_rollout_ref.actor.entropy_coeff=0.0
actor_rollout_ref.actor.use_kl_loss=False
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.ppo_epochs=1
actor_rollout_ref.actor.optim.lr=1e-6
  • actor.strategy: Backend strategy (“fsdp” or “megatron”)

  • actor.ppo_mini_batch_size: Mini-batch size for PPO updates (global across GPUs)

  • actor.ppo_micro_batch_size_per_gpu: Micro-batch size per GPU (gradient accumulation)

  • actor.grad_clip: Gradient clipping threshold

  • actor.clip_ratio: PPO clip ratio

  • actor.use_kl_loss: Enable KL loss in actor

  • actor.kl_loss_coef: KL loss coefficient (for GRPO)

  • actor.optim.lr: Learning rate

Reference Model

actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.ref.fsdp_config.param_offload=False
  • ref.log_prob_micro_batch_size_per_gpu: Micro-batch size for reference log prob computation

  • ref.fsdp_config.param_offload: Enable parameter offloading (recommended for models > 7B)

Rollout Configuration

actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.temperature=1.0
actor_rollout_ref.rollout.top_k=-1
actor_rollout_ref.rollout.top_p=1.0
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.gpu_memory_utilization=0.5
actor_rollout_ref.rollout.n=8
  • rollout.name: Rollout backend (“vllm”, “sglang”, “hf”)

  • rollout.temperature: Sampling temperature

  • rollout.top_k: Top-k sampling (-1 for vLLM, 0 for HF)

  • rollout.top_p: Top-p sampling

  • rollout.tensor_model_parallel_size: Tensor parallelism size (vLLM only)

  • rollout.gpu_memory_utilization: GPU memory fraction for vLLM

  • rollout.n: Number of responses per prompt (>1 for GRPO/RLOO)

Critic Model

Location: siirl/params/model_args.py

critic.enable=True
critic.model.path=/path/to/critic_model
critic.ppo_mini_batch_size=256
critic.ppo_micro_batch_size_per_gpu=8
critic.optim.lr=1e-5

Most parameters are similar to Actor configuration.

Reward Model

Location: siirl/params/model_args.py

reward_model.enable=False
reward_model.model.path=/path/to/reward_model
reward_model.model.input_tokenizer=null
reward_model.micro_batch_size_per_gpu=16
reward_model.reward_manager=naive
  • reward_model.enable: Enable reward model (False = use only custom reward functions)

  • reward_model.model.input_tokenizer: Input tokenizer path (if different from policy)

  • reward_model.reward_manager: Reward manager type (“naive”, “batch”, “parallel”, “dapo”, “embodied”)

Custom Reward Function

custom_reward_function.path=/path/to/my_reward.py
custom_reward_function.name=compute_score
  • custom_reward_function.path: Path to custom reward function file

  • custom_reward_function.name: Function name (default: “compute_score”)

See Reward Interface for details.

Algorithm Parameters

Location: siirl/params/model_args.py

algorithm.gamma=1.0
algorithm.lam=1.0
algorithm.adv_estimator=grpo
algorithm.use_kl_in_reward=False
algorithm.kl_penalty=kl
algorithm.kl_ctrl.type=fixed
algorithm.kl_ctrl.kl_coef=0.005
algorithm.workflow_type=DEFAULT
  • algorithm.gamma: Discount factor

  • algorithm.lam: GAE lambda (bias-variance tradeoff)

  • algorithm.adv_estimator: Advantage estimator (“gae”, “grpo”, “cpgd”, “gspo”, “rloo”)

  • algorithm.use_kl_in_reward: Enable KL penalty in reward

  • algorithm.kl_penalty: KL divergence calculation method (“kl”, “abs”, “mse”, “low_var_kl”, “full”)

  • algorithm.workflow_type: Workflow type (“DEFAULT”, “DAPO”, “EMBODIED”)

Training Parameters

Location: siirl/params/training_args.py

trainer.total_epochs=30
trainer.project_name=siirl_examples
trainer.experiment_name=gsm8k
trainer.logger=['console', 'wandb']
trainer.nnodes=1
trainer.n_gpus_per_node=8
trainer.save_freq=10
trainer.val_before_train=True
trainer.test_freq=2
  • trainer.total_epochs: Number of training epochs

  • trainer.project_name: Project name (for logging)

  • trainer.experiment_name: Experiment name (for logging)

  • trainer.logger: Logger types ([“console”, “wandb”, “tensorboard”, “mlflow”])

  • trainer.nnodes: Number of nodes

  • trainer.n_gpus_per_node: Number of GPUs per node

  • trainer.save_freq: Checkpoint saving frequency (by iteration)

  • trainer.val_before_train: Run validation before training

  • trainer.test_freq: Validation frequency (by iteration)

DAG Parameters

Location: siirl/params/dag_args.py

dag.custom_pipeline_fn=null
  • dag.custom_pipeline_fn: Custom pipeline function path (e.g., “module:function”)

See Pipeline API for custom pipeline details.

Complete Example

GRPO Training

python -m siirl.main_dag \
  algorithm.adv_estimator=grpo \
  algorithm.workflow_type=DEFAULT \
  data.train_files=/path/to/gsm8k/train.parquet \
  data.train_batch_size=512 \
  data.max_prompt_length=2048 \
  data.max_response_length=4096 \
  actor_rollout_ref.model.path=/path/to/model \
  actor_rollout_ref.actor.optim.lr=1e-6 \
  actor_rollout_ref.actor.ppo_mini_batch_size=256 \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
  actor_rollout_ref.rollout.n=8 \
  custom_reward_function.path=siirl/user_interface/rewards_interface/custom_gsm8k_reward.py \
  custom_reward_function.name=compute_score \
  trainer.total_epochs=30 \
  trainer.n_gpus_per_node=8 \
  trainer.save_freq=10

PPO Training

python -m siirl.main_dag \
  algorithm.adv_estimator=gae \
  critic.enable=True \
  data.train_files=/path/to/data.parquet \
  actor_rollout_ref.model.path=/path/to/model \
  actor_rollout_ref.actor.optim.lr=1e-6 \
  actor_rollout_ref.rollout.name=vllm \
  critic.optim.lr=1e-5 \
  trainer.total_epochs=30

DAPO Training

python -m siirl.main_dag \
  algorithm.workflow_type=DAPO \
  algorithm.adv_estimator=grpo \
  algorithm.filter_groups.enable=True \
  algorithm.filter_groups.metric=seq_final_reward \
  data.train_files=/path/to/data.parquet \
  actor_rollout_ref.model.path=/path/to/model \
  trainer.total_epochs=30

Parameter Reference

For the complete parameter definitions, see:

  • siirl/params/data_args.py - Data parameters

  • siirl/params/model_args.py - Model, algorithm parameters

  • siirl/params/training_args.py - Training parameters

  • siirl/params/dag_args.py - DAG workflow parameters

  • siirl/params/profiler_args.py - Profiler parameters