DeepScaleR Example with PPO
=============================

Introduction
------------

This example demonstrates how to fine-tune a Large Language Model for advanced mathematical reasoning using the **DeepScaleR** dataset.

**Paper:** https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2.

**Dataset:** https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset

The core idea is to leverage Reinforcement Learning (RL), specifically Proximal Policy Optimization (PPO), to teach the model not just to find the correct answer, but to follow a logical, step-by-step reasoning process. This is achieved by rewarding the model based on the correctness of its final answer, which is extracted from a structured output.

Dataset Overview
----------------

The DeepScaleR dataset consists of challenging mathematical problems. Each sample includes a question (`problem`), a detailed reasoning path (`solution`), and a final answer enclosed in a `\\boxed{}` block (`answer`).

**An example from DeepScaleR:**

**Prompt:**
   "Let $a_n=6^{n}+8^{n}$. Determine the remainder upon dividing $a_ {83}$ by $49$."

**Solution:**
   "$6^{83} + 8^{83} = (6+8)(6^{82}-6^{81}8+\\ldots-8^{81}6+8^{82})$\n Becuase $7|(6+8)$, we only consider $6^{82}-6^{81}8+\\ldots-8^{81}6+8^{82} \\pmod{7}$\n$6^{82}-6^{81}8+\\ldots-8^{81}6+8^{82} \\equiv (-1)^{82} - (-1)^{81}+ \\ldots - (-1)^1 + 1 = 83 \\equiv 6 \\pmod{7}$\n$6^{83} + 8^{83} \\equiv 14 \\cdot 6 \\equiv \\boxed{035} \\pmod{49}$"

**Answer:**
   `35`

Step 1: Prepare the Dataset
---------------------------

First, preprocess the DeepScaleR dataset into the required Parquet format. Our framework includes a script for this purpose.

.. code:: bash

   cd examples/data_preprocess
   python3 deepscaler.py --local_dir ~/data/deepscaler

This will download the dataset from Hugging Face, process it, and save `train.parquet` and `test.parquet` files in the `~/data/deepscaler` directory.

Step 2: Download the Pre-trained Model
--------------------------------------

You need a base model to start the PPO training. In this example, we use `Qwen2.5-7B-Instruct`. There are several ways to make the model available to the trainer:

- **Recommended: Download via CLI:** Use tools like `huggingface-cli` or `modelscope` to download the model to a local directory. This gives you more control.

  .. code:: bash

     # For Hugging Face
     huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ~/data/models/Qwen2.5-7B-Instruct --local-dir-use-symlinks False
     
     # For ModelScope
     modelscope download Qwen/Qwen2.5-7B-Instruct --local_dir ~/data/models/Qwen2.5-7B-Instruct

- **Automatic Download:** You can also specify the Hugging Face model name (e.g., `Qwen/Qwen2.5-7B-Instruct`) directly in the `actor_rollout_ref.model.path` and `critic.model.path` fields of your run script. The framework will attempt to download it automatically on the first run.

Step 3: Perform PPO Training
----------------------------

With the data and model ready, you can now launch the PPO training job.

**Reward Function**

For this task, we use a simple but effective rule-based reward function. The framework's default reward mechanism will be used, which performs an exact match between the model's generated answer and the `ground_truth` from the dataset.
- The model is prompted to provide its final answer inside a `\\boxed{...}` block.
- The reward function checks if the content inside the generated `\\boxed{}` matches the ground truth answer.
- A correct match receives a positive reward (e.g., 1.0), while an incorrect match or a malformed response receives zero reward.

**Training Script**

Below is a complete training script based on `examples/ppo_trainer/run_qwen3-8b.sh`. It is configured for a single-node, multi-GPU setup. You should adapt paths like `HOME` to your environment.

.. literalinclude:: ../../examples/ppo_trainer/run_qwen3-8b.sh
   :language: bash
   :caption: examples/ppo_trainer/run_qwen2_5-7b.sh