.. _config-explain-page:

===================
Configuration Guide
===================

siiRL uses Hydra-based configuration management with dataclass parameters. All configuration parameters are defined in the ``siirl/params/`` directory and can be set via command-line arguments.

Configuration Structure
-----------------------

Parameters are organized into the following modules:

- ``DataArguments``: Data-related parameters (``siirl/params/data_args.py``)
- ``ActorRolloutRefArguments``: Actor, Rollout, and Reference model parameters (``siirl/params/model_args.py``)
- ``CriticArguments``: Critic model parameters (``siirl/params/model_args.py``)
- ``RewardModelArguments``: Reward model parameters (``siirl/params/model_args.py``)
- ``AlgorithmArguments``: RL algorithm parameters (``siirl/params/model_args.py``)
- ``TrainingArguments``: Training configuration (``siirl/params/training_args.py``)
- ``DAGArguments``: DAG workflow parameters (``siirl/params/dag_args.py``)
- ``ProfilerArguments``: Profiling parameters (``siirl/params/profiler_args.py``)

All parameters are combined into the ``SiiRLArguments`` class.

Usage
-----

Parameters are set via command-line arguments using dot notation:

.. code-block:: bash

   python -m siirl.main_dag \
     data.train_files=/path/to/train.parquet \
     data.train_batch_size=512 \
     actor_rollout_ref.model.path=/path/to/model \
     algorithm.adv_estimator=grpo \
     trainer.total_epochs=30

Data Parameters
---------------

Location: ``siirl/params/data_args.py``

.. code-block:: bash

   data.tokenizer=null
   data.train_files=/path/to/train.parquet
   data.val_files=/path/to/val.parquet
   data.prompt_key=prompt
   data.max_prompt_length=512
   data.max_response_length=512
   data.train_batch_size=1024
   data.return_raw_input_ids=False
   data.return_raw_chat=False
   data.return_full_prompt=False
   data.shuffle=True
   data.filter_overlong_prompts=False
   data.filter_overlong_prompts_workers=1
   data.truncation=error
   data.image_key=images
   data.trust_remote_code=True

**Key Parameters:**

- ``data.train_files``: Training data file path (Parquet format, can be list or single file)
- ``data.val_files``: Validation data file path
- ``data.prompt_key``: Field name for prompt in dataset (default: "prompt")
- ``data.max_prompt_length``: Maximum prompt length (left-padded)
- ``data.max_response_length``: Maximum response length for rollout generation
- ``data.train_batch_size``: Training batch size per iteration
- ``data.return_raw_input_ids``: Return original input_ids without chat template (for different RM chat templates)
- ``data.shuffle``: Whether to shuffle data
- ``data.truncation``: Truncation strategy ("error", "left", "right", "middle")
- ``data.trust_remote_code``: Allow remote code execution for tokenizers

Custom Dataset
~~~~~~~~~~~~~~

.. code-block:: bash

   data.custom_cls.path=/path/to/custom_dataset.py
   data.custom_cls.name=MyDatasetClass

- ``data.custom_cls.path``: Path to custom dataset class file
- ``data.custom_cls.name``: Name of the dataset class

Actor/Rollout/Reference Model
------------------------------

Location: ``siirl/params/model_args.py``

Model Configuration
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   actor_rollout_ref.hybrid_engine=True
   actor_rollout_ref.model.path=/path/to/model
   actor_rollout_ref.model.external_lib=null
   actor_rollout_ref.model.enable_gradient_checkpointing=False
   actor_rollout_ref.model.enable_activation_offload=False
   actor_rollout_ref.model.trust_remote_code=False
   actor_rollout_ref.model.use_remove_padding=False

- ``actor_rollout_ref.model.path``: Huggingface model path (local or HDFS)
- ``actor_rollout_ref.model.external_lib``: Additional Python packages to import
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: Enable gradient checkpointing
- ``actor_rollout_ref.model.enable_activation_offload``: Enable activation offloading
- ``actor_rollout_ref.model.trust_remote_code``: Allow remote code model loading
- ``actor_rollout_ref.model.use_remove_padding``: Remove padding tokens for efficiency

Actor Configuration
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   actor_rollout_ref.actor.strategy=fsdp
   actor_rollout_ref.actor.ppo_mini_batch_size=256
   actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
   actor_rollout_ref.actor.grad_clip=1.0
   actor_rollout_ref.actor.clip_ratio=0.2
   actor_rollout_ref.actor.entropy_coeff=0.0
   actor_rollout_ref.actor.use_kl_loss=False
   actor_rollout_ref.actor.kl_loss_coef=0.001
   actor_rollout_ref.actor.ppo_epochs=1
   actor_rollout_ref.actor.optim.lr=1e-6

- ``actor.strategy``: Backend strategy ("fsdp" or "megatron")
- ``actor.ppo_mini_batch_size``: Mini-batch size for PPO updates (global across GPUs)
- ``actor.ppo_micro_batch_size_per_gpu``: Micro-batch size per GPU (gradient accumulation)
- ``actor.grad_clip``: Gradient clipping threshold
- ``actor.clip_ratio``: PPO clip ratio
- ``actor.use_kl_loss``: Enable KL loss in actor
- ``actor.kl_loss_coef``: KL loss coefficient (for GRPO)
- ``actor.optim.lr``: Learning rate

Reference Model
~~~~~~~~~~~~~~~

.. code-block:: bash

   actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16
   actor_rollout_ref.ref.fsdp_config.param_offload=False

- ``ref.log_prob_micro_batch_size_per_gpu``: Micro-batch size for reference log prob computation
- ``ref.fsdp_config.param_offload``: Enable parameter offloading (recommended for models > 7B)

Rollout Configuration
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   actor_rollout_ref.rollout.name=vllm
   actor_rollout_ref.rollout.temperature=1.0
   actor_rollout_ref.rollout.top_k=-1
   actor_rollout_ref.rollout.top_p=1.0
   actor_rollout_ref.rollout.tensor_model_parallel_size=2
   actor_rollout_ref.rollout.gpu_memory_utilization=0.5
   actor_rollout_ref.rollout.n=8

- ``rollout.name``: Rollout backend ("vllm", "sglang", "hf")
- ``rollout.temperature``: Sampling temperature
- ``rollout.top_k``: Top-k sampling (-1 for vLLM, 0 for HF)
- ``rollout.top_p``: Top-p sampling
- ``rollout.tensor_model_parallel_size``: Tensor parallelism size (vLLM only)
- ``rollout.gpu_memory_utilization``: GPU memory fraction for vLLM
- ``rollout.n``: Number of responses per prompt (>1 for GRPO/RLOO)

Critic Model
------------

Location: ``siirl/params/model_args.py``

.. code-block:: bash

   critic.enable=True
   critic.model.path=/path/to/critic_model
   critic.ppo_mini_batch_size=256
   critic.ppo_micro_batch_size_per_gpu=8
   critic.optim.lr=1e-5

Most parameters are similar to Actor configuration.

Reward Model
------------

Location: ``siirl/params/model_args.py``

.. code-block:: bash

   reward_model.enable=False
   reward_model.model.path=/path/to/reward_model
   reward_model.model.input_tokenizer=null
   reward_model.micro_batch_size_per_gpu=16
   reward_model.reward_manager=naive

- ``reward_model.enable``: Enable reward model (False = use only custom reward functions)
- ``reward_model.model.input_tokenizer``: Input tokenizer path (if different from policy)
- ``reward_model.reward_manager``: Reward manager type ("naive", "batch", "parallel", "dapo", "embodied")

Custom Reward Function
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   custom_reward_function.path=/path/to/my_reward.py
   custom_reward_function.name=compute_score

- ``custom_reward_function.path``: Path to custom reward function file
- ``custom_reward_function.name``: Function name (default: "compute_score")

See :doc:`/user_interface/reward_interface` for details.

Algorithm Parameters
--------------------

Location: ``siirl/params/model_args.py``

.. code-block:: bash

   algorithm.gamma=1.0
   algorithm.lam=1.0
   algorithm.adv_estimator=grpo
   algorithm.use_kl_in_reward=False
   algorithm.kl_penalty=kl
   algorithm.kl_ctrl.type=fixed
   algorithm.kl_ctrl.kl_coef=0.005
   algorithm.workflow_type=DEFAULT

- ``algorithm.gamma``: Discount factor
- ``algorithm.lam``: GAE lambda (bias-variance tradeoff)
- ``algorithm.adv_estimator``: Advantage estimator ("gae", "grpo", "cpgd", "gspo", "rloo")
- ``algorithm.use_kl_in_reward``: Enable KL penalty in reward
- ``algorithm.kl_penalty``: KL divergence calculation method ("kl", "abs", "mse", "low_var_kl", "full")
- ``algorithm.workflow_type``: Workflow type ("DEFAULT", "DAPO", "EMBODIED")

Training Parameters
-------------------

Location: ``siirl/params/training_args.py``

.. code-block:: bash

   trainer.total_epochs=30
   trainer.project_name=siirl_examples
   trainer.experiment_name=gsm8k
   trainer.logger=['console', 'wandb']
   trainer.nnodes=1
   trainer.n_gpus_per_node=8
   trainer.save_freq=10
   trainer.val_before_train=True
   trainer.test_freq=2

- ``trainer.total_epochs``: Number of training epochs
- ``trainer.project_name``: Project name (for logging)
- ``trainer.experiment_name``: Experiment name (for logging)
- ``trainer.logger``: Logger types (["console", "wandb", "tensorboard", "mlflow"])
- ``trainer.nnodes``: Number of nodes
- ``trainer.n_gpus_per_node``: Number of GPUs per node
- ``trainer.save_freq``: Checkpoint saving frequency (by iteration)
- ``trainer.val_before_train``: Run validation before training
- ``trainer.test_freq``: Validation frequency (by iteration)

DAG Parameters
--------------

Location: ``siirl/params/dag_args.py``

.. code-block:: bash

   dag.custom_pipeline_fn=null

- ``dag.custom_pipeline_fn``: Custom pipeline function path (e.g., "module:function")

See :doc:`/user_interface/pipeline_interface` for custom pipeline details.

Complete Example
----------------

GRPO Training
~~~~~~~~~~~~~

.. code-block:: bash

   python -m siirl.main_dag \
     algorithm.adv_estimator=grpo \
     algorithm.workflow_type=DEFAULT \
     data.train_files=/path/to/gsm8k/train.parquet \
     data.train_batch_size=512 \
     data.max_prompt_length=2048 \
     data.max_response_length=4096 \
     actor_rollout_ref.model.path=/path/to/model \
     actor_rollout_ref.actor.optim.lr=1e-6 \
     actor_rollout_ref.actor.ppo_mini_batch_size=256 \
     actor_rollout_ref.rollout.name=vllm \
     actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
     actor_rollout_ref.rollout.n=8 \
     custom_reward_function.path=siirl/user_interface/rewards_interface/custom_gsm8k_reward.py \
     custom_reward_function.name=compute_score \
     trainer.total_epochs=30 \
     trainer.n_gpus_per_node=8 \
     trainer.save_freq=10

PPO Training
~~~~~~~~~~~~

.. code-block:: bash

   python -m siirl.main_dag \
     algorithm.adv_estimator=gae \
     critic.enable=True \
     data.train_files=/path/to/data.parquet \
     actor_rollout_ref.model.path=/path/to/model \
     actor_rollout_ref.actor.optim.lr=1e-6 \
     actor_rollout_ref.rollout.name=vllm \
     critic.optim.lr=1e-5 \
     trainer.total_epochs=30

DAPO Training
~~~~~~~~~~~~~

.. code-block:: bash

   python -m siirl.main_dag \
     algorithm.workflow_type=DAPO \
     algorithm.adv_estimator=grpo \
     algorithm.filter_groups.enable=True \
     algorithm.filter_groups.metric=seq_final_reward \
     data.train_files=/path/to/data.parquet \
     actor_rollout_ref.model.path=/path/to/model \
     trainer.total_epochs=30

Parameter Reference
-------------------

For the complete parameter definitions, see:

- ``siirl/params/data_args.py`` - Data parameters
- ``siirl/params/model_args.py`` - Model, algorithm parameters
- ``siirl/params/training_args.py`` - Training parameters
- ``siirl/params/dag_args.py`` - DAG workflow parameters
- ``siirl/params/profiler_args.py`` - Profiler parameters