Prepare Data for Post-Training

Before starting the post-training job, we need to prepare the data for policy training. The data should be preprocessed and stored in Parquet format, which facilitates efficient distributed data loading and processing.

We provide several data preprocessing scripts for popular datasets under the examples/data_preprocess/ directory, such as gsm8k.py, math_dataset.py, and deepscaler.py. To support a new custom dataset, you will need to create a similar script.

This document uses the DeepScaleR dataset as an example to detail the data preparation process and its specifications.

General Data Preprocessing Workflow

A typical data preprocessing script involves the following steps:

Load Raw Data: Use a library like Hugging Face’s datasets to load the original dataset from the Hub or local files.
Define Processing Logic: Implement a core mapping function (which we often name make_map_fn) to convert each sample from the original dataset into the specific format required by our framework.
Apply Transformation and Save: Use the datasets.map() method to apply this function to the entire dataset. Then, save the processed result in Parquet format locally, with an option to upload it to a distributed file system like HDFS.

Here is a simplified framework of the process:

import argparse
import os
import datasets
from siirl.utils.extras.hdfs_io import copy, makedirs

def make_map_fn(split_name):
    # ... Define your data processing logic here ...
    def process_fn(example, idx):
        # ... Transform each data sample ...
        return transformed_data
    return process_fn

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # ... Define arguments ...
    args = parser.parse_args()

    # 1. Load data
    raw_dataset = datasets.load_dataset(...)

    # 2. Apply transformation
    processed_dataset = raw_dataset.map(function=make_map_fn('train'), with_indices=True)

    # 3. Save as Parquet
    local_dir = args.local_dir
    processed_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))

    # (Optional) Upload to HDFS
    if args.hdfs_dir:
        makedirs(args.hdfs_dir)
        copy(src=local_dir, dst=args.hdfs_dir)

DeepScaleR Dataset Processing in Practice

Let’s take examples/data_preprocess/deepscaler.py as a concrete example to demonstrate how to process the agentica-org/DeepScaleR-Preview-Dataset.

The core task is to implement the make_map_fn function, which maps original fields (like problem, answer, and solution) to the standard format required by the training framework.

data_source = "agentica-org/DeepScaleR-Preview-Dataset"
instruction_following = 'Let\'s think step by step and output the final within \\boxed{}.'

def make_map_fn(split_name):

    def process_fn(example, idx):
        question_raw = example.pop("problem")
        answer_raw = example.pop("answer")

        question = question_raw + " " + instruction_following
        solution = example.pop("solution")
        data = {
            "data_source": data_source,
            "prompt": [
                {
                "role": "user",
                    "content": question,
                }
            ],
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": answer_raw},
            "extra_info": {
                "split": split_name,
                "index": idx,
                "answer": solution,
                "question": question_raw,
            },
        }

        return data

    return process_fn

Data Format Specification

To ensure the framework can correctly parse and utilize the data, each sample processed by make_map_fn must contain the following five key fields:

data_source: A string indicating the source or name of the dataset. This field is used to dynamically select the corresponding reward function during training. - Example: "agentica-org/DeepScaleR-Preview-Dataset"
prompt: A list used to construct the model’s input, formatted to be compatible with Hugging Face’s Chat Template. The data loader will automatically apply the template and tokenize the input. - Example: [{"role": "user", "content": "What is 2+2? Let's think step by step..."}]
ability: A string defining the domain or capability of the current task, such as "math", "coding", or "general".
reward_model: A dictionary containing information needed to calculate the reward. Currently, the ground_truth field is primarily used during evaluation. - Note: The ground_truth you provide must align with the logic of the corresponding reward function you implement. For a math problem, it might be the standard answer; for code generation, it could be a set of unit tests. - Example: {"style": "rule", "ground_truth": "\\boxed{4}"}
extra_info: A dictionary for storing additional metadata, such as the original dataset split (train/test) or sample index. This information is not used directly in training but is useful for debugging and data traceability.

By following these specifications, you can prepare your dataset to be used smoothly within the SiiRL post-training pipeline.