Configuration Guide
siiRL uses Hydra-based configuration management with dataclass parameters. All configuration parameters are defined in the siirl/params/ directory and can be set via command-line arguments.
Configuration Structure
Parameters are organized into the following modules:
DataArguments: Data-related parameters (siirl/params/data_args.py)ActorRolloutRefArguments: Actor, Rollout, and Reference model parameters (siirl/params/model_args.py)CriticArguments: Critic model parameters (siirl/params/model_args.py)RewardModelArguments: Reward model parameters (siirl/params/model_args.py)AlgorithmArguments: RL algorithm parameters (siirl/params/model_args.py)TrainingArguments: Training configuration (siirl/params/training_args.py)DAGArguments: DAG workflow parameters (siirl/params/dag_args.py)ProfilerArguments: Profiling parameters (siirl/params/profiler_args.py)
All parameters are combined into the SiiRLArguments class.
Usage
Parameters are set via command-line arguments using dot notation:
python -m siirl.main_dag \
data.train_files=/path/to/train.parquet \
data.train_batch_size=512 \
actor_rollout_ref.model.path=/path/to/model \
algorithm.adv_estimator=grpo \
trainer.total_epochs=30
Data Parameters
Location: siirl/params/data_args.py
data.tokenizer=null
data.train_files=/path/to/train.parquet
data.val_files=/path/to/val.parquet
data.prompt_key=prompt
data.max_prompt_length=512
data.max_response_length=512
data.train_batch_size=1024
data.return_raw_input_ids=False
data.return_raw_chat=False
data.return_full_prompt=False
data.shuffle=True
data.filter_overlong_prompts=False
data.filter_overlong_prompts_workers=1
data.truncation=error
data.image_key=images
data.trust_remote_code=True
Key Parameters:
data.train_files: Training data file path (Parquet format, can be list or single file)data.val_files: Validation data file pathdata.prompt_key: Field name for prompt in dataset (default: “prompt”)data.max_prompt_length: Maximum prompt length (left-padded)data.max_response_length: Maximum response length for rollout generationdata.train_batch_size: Training batch size per iterationdata.return_raw_input_ids: Return original input_ids without chat template (for different RM chat templates)data.shuffle: Whether to shuffle datadata.truncation: Truncation strategy (“error”, “left”, “right”, “middle”)data.trust_remote_code: Allow remote code execution for tokenizers
Custom Dataset
data.custom_cls.path=/path/to/custom_dataset.py
data.custom_cls.name=MyDatasetClass
data.custom_cls.path: Path to custom dataset class filedata.custom_cls.name: Name of the dataset class
Actor/Rollout/Reference Model
Location: siirl/params/model_args.py
Model Configuration
actor_rollout_ref.hybrid_engine=True
actor_rollout_ref.model.path=/path/to/model
actor_rollout_ref.model.external_lib=null
actor_rollout_ref.model.enable_gradient_checkpointing=False
actor_rollout_ref.model.enable_activation_offload=False
actor_rollout_ref.model.trust_remote_code=False
actor_rollout_ref.model.use_remove_padding=False
actor_rollout_ref.model.path: Huggingface model path (local or HDFS)actor_rollout_ref.model.external_lib: Additional Python packages to importactor_rollout_ref.model.enable_gradient_checkpointing: Enable gradient checkpointingactor_rollout_ref.model.enable_activation_offload: Enable activation offloadingactor_rollout_ref.model.trust_remote_code: Allow remote code model loadingactor_rollout_ref.model.use_remove_padding: Remove padding tokens for efficiency
Actor Configuration
actor_rollout_ref.actor.strategy=fsdp
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.actor.grad_clip=1.0
actor_rollout_ref.actor.clip_ratio=0.2
actor_rollout_ref.actor.entropy_coeff=0.0
actor_rollout_ref.actor.use_kl_loss=False
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.ppo_epochs=1
actor_rollout_ref.actor.optim.lr=1e-6
actor.strategy: Backend strategy (“fsdp” or “megatron”)actor.ppo_mini_batch_size: Mini-batch size for PPO updates (global across GPUs)actor.ppo_micro_batch_size_per_gpu: Micro-batch size per GPU (gradient accumulation)actor.grad_clip: Gradient clipping thresholdactor.clip_ratio: PPO clip ratioactor.use_kl_loss: Enable KL loss in actoractor.kl_loss_coef: KL loss coefficient (for GRPO)actor.optim.lr: Learning rate
Reference Model
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16
actor_rollout_ref.ref.fsdp_config.param_offload=False
ref.log_prob_micro_batch_size_per_gpu: Micro-batch size for reference log prob computationref.fsdp_config.param_offload: Enable parameter offloading (recommended for models > 7B)
Rollout Configuration
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.temperature=1.0
actor_rollout_ref.rollout.top_k=-1
actor_rollout_ref.rollout.top_p=1.0
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.gpu_memory_utilization=0.5
actor_rollout_ref.rollout.n=8
rollout.name: Rollout backend (“vllm”, “sglang”, “hf”)rollout.temperature: Sampling temperaturerollout.top_k: Top-k sampling (-1 for vLLM, 0 for HF)rollout.top_p: Top-p samplingrollout.tensor_model_parallel_size: Tensor parallelism size (vLLM only)rollout.gpu_memory_utilization: GPU memory fraction for vLLMrollout.n: Number of responses per prompt (>1 for GRPO/RLOO)
Critic Model
Location: siirl/params/model_args.py
critic.enable=True
critic.model.path=/path/to/critic_model
critic.ppo_mini_batch_size=256
critic.ppo_micro_batch_size_per_gpu=8
critic.optim.lr=1e-5
Most parameters are similar to Actor configuration.
Reward Model
Location: siirl/params/model_args.py
reward_model.enable=False
reward_model.model.path=/path/to/reward_model
reward_model.model.input_tokenizer=null
reward_model.micro_batch_size_per_gpu=16
reward_model.reward_manager=naive
reward_model.enable: Enable reward model (False = use only custom reward functions)reward_model.model.input_tokenizer: Input tokenizer path (if different from policy)reward_model.reward_manager: Reward manager type (“naive”, “batch”, “parallel”, “dapo”, “embodied”)
Custom Reward Function
custom_reward_function.path=/path/to/my_reward.py
custom_reward_function.name=compute_score
custom_reward_function.path: Path to custom reward function filecustom_reward_function.name: Function name (default: “compute_score”)
See Reward Interface for details.
Algorithm Parameters
Location: siirl/params/model_args.py
algorithm.gamma=1.0
algorithm.lam=1.0
algorithm.adv_estimator=grpo
algorithm.use_kl_in_reward=False
algorithm.kl_penalty=kl
algorithm.kl_ctrl.type=fixed
algorithm.kl_ctrl.kl_coef=0.005
algorithm.workflow_type=DEFAULT
algorithm.gamma: Discount factoralgorithm.lam: GAE lambda (bias-variance tradeoff)algorithm.adv_estimator: Advantage estimator (“gae”, “grpo”, “cpgd”, “gspo”, “rloo”)algorithm.use_kl_in_reward: Enable KL penalty in rewardalgorithm.kl_penalty: KL divergence calculation method (“kl”, “abs”, “mse”, “low_var_kl”, “full”)algorithm.workflow_type: Workflow type (“DEFAULT”, “DAPO”, “EMBODIED”)
Training Parameters
Location: siirl/params/training_args.py
trainer.total_epochs=30
trainer.project_name=siirl_examples
trainer.experiment_name=gsm8k
trainer.logger=['console', 'wandb']
trainer.nnodes=1
trainer.n_gpus_per_node=8
trainer.save_freq=10
trainer.val_before_train=True
trainer.test_freq=2
trainer.total_epochs: Number of training epochstrainer.project_name: Project name (for logging)trainer.experiment_name: Experiment name (for logging)trainer.logger: Logger types ([“console”, “wandb”, “tensorboard”, “mlflow”])trainer.nnodes: Number of nodestrainer.n_gpus_per_node: Number of GPUs per nodetrainer.save_freq: Checkpoint saving frequency (by iteration)trainer.val_before_train: Run validation before trainingtrainer.test_freq: Validation frequency (by iteration)
DAG Parameters
Location: siirl/params/dag_args.py
dag.custom_pipeline_fn=null
dag.custom_pipeline_fn: Custom pipeline function path (e.g., “module:function”)
See Pipeline API for custom pipeline details.
Complete Example
GRPO Training
python -m siirl.main_dag \
algorithm.adv_estimator=grpo \
algorithm.workflow_type=DEFAULT \
data.train_files=/path/to/gsm8k/train.parquet \
data.train_batch_size=512 \
data.max_prompt_length=2048 \
data.max_response_length=4096 \
actor_rollout_ref.model.path=/path/to/model \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.n=8 \
custom_reward_function.path=siirl/user_interface/rewards_interface/custom_gsm8k_reward.py \
custom_reward_function.name=compute_score \
trainer.total_epochs=30 \
trainer.n_gpus_per_node=8 \
trainer.save_freq=10
PPO Training
python -m siirl.main_dag \
algorithm.adv_estimator=gae \
critic.enable=True \
data.train_files=/path/to/data.parquet \
actor_rollout_ref.model.path=/path/to/model \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.rollout.name=vllm \
critic.optim.lr=1e-5 \
trainer.total_epochs=30
DAPO Training
python -m siirl.main_dag \
algorithm.workflow_type=DAPO \
algorithm.adv_estimator=grpo \
algorithm.filter_groups.enable=True \
algorithm.filter_groups.metric=seq_final_reward \
data.train_files=/path/to/data.parquet \
actor_rollout_ref.model.path=/path/to/model \
trainer.total_epochs=30
Parameter Reference
For the complete parameter definitions, see:
siirl/params/data_args.py- Data parameterssiirl/params/model_args.py- Model, algorithm parameterssiirl/params/training_args.py- Training parameterssiirl/params/dag_args.py- DAG workflow parameterssiirl/params/profiler_args.py- Profiler parameters