MM-Eureka Example with GRPO
===========================
Introduction
------------
This guide details how to fine-tune a multi-modal Large Language Model using the **Group Relative Policy Optimization (GRPO)** algorithm on the **MM-Eureka** dataset. MM-Eureka is a challenging dataset designed to test mathematical reasoning that requires interpreting both text and images.
**Paper:** https://arxiv.org/pdf/2503.07365.
**Dataset:** https://huggingface.co/datasets/FanqingM/MM-Eureka-Dataset
The goal is to enhance a model's ability to perform complex reasoning by processing visual and textual information simultaneously. We use GRPO, an advanced RL algorithm, to optimize the model's policy.
Dataset Overview
----------------
MM-Eureka problems consist of a text-based question paired with one or more images. The model must understand the content of the image to solve the problem correctly.
**An example from MM-Eureka:**
**Prompt:**
.. image:: https://github.com/sii-research/siiRL/raw/main/docs/_static/cube.jpg
:width: 50%
Question: A cube loses one vertex after a 'corner' is removed. This geometric shape is ___ (fill in the number).
**Answer:**
3
Step 1: Data Preprocessing
--------------------------
The raw MM-Eureka dataset, typically in `.jsonl` format, must be converted to Parquet. This involves not only structuring the text but also processing the associated images.
The script `examples/data_preprocess/mm_eureka.py` handles this. It performs the following actions:
- Parses each line of the input JSONL file.
- Reads the image file specified in `image_urls` and embeds its byte content directly into the Parquet file.
- Formats the user prompts to include instructions for the desired output structure (`......`).
- Splits the data into training and testing sets.
Run the script with your dataset file:
.. code:: bash
cd examples/data_preprocess
python3 mm_eureka.py --jsonl_file /path/to/your/mm_eureka_data.jsonl --output_dir ~/data/mm_eureka/
Step 2: Defining the Reward Score
---------------------------------
A custom reward function is crucial for multi-modal reasoning. For MM-Eureka, we use a composite score defined in `siirl/utils/reward_score/mm_eureka.py`. This function evaluates two aspects of the model's response:
1. **Accuracy Reward**: This is the primary component. It parses the mathematical expression from the model's output (often in LaTeX) and compares it against the ground truth using the `math_verify` utility. This provides a robust check for mathematical correctness.
2. **Format Reward**: A smaller, secondary reward is given if the model correctly follows the required `......` structure. This encourages the model to generate well-formed, interpretable reasoning chains.
The final reward is a weighted sum of these two components (e.g., `0.9 * accuracy_reward + 0.1 * format_reward`), balancing correctness with style.
Step 3: Download the Pre-trained Model
--------------------------------------
For this multi-modal task, we use a powerful vision-language model like `Qwen2.5-VL-7B-Instruct`. Ensure the model is available locally for the training script.
- **Recommended: Download via CLI:**
.. code:: bash
# For Hugging Face
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ~/data/models/Qwen2.5-VL-7B-Instruct
# For ModelScope
modelscope download Qwen/Qwen2.5-VL-7B-Instruct --local_dir ~/data/models/Qwen2.5-VL-7B-Instruct
- **Automatic Download:** Alternatively, specify the model identifier directly in the run script's `actor_rollout_ref.model.path` field.
Step 4: Perform GRPO Training
-----------------------------
With the data and model prepared, you can launch the training job using the GRPO algorithm.
**Training Script**
The script `examples/grpo_trainer/run_qwen2_5_vl-7b.sh` provides a complete configuration for this task. It sets up the environment, Ray cluster, and all necessary hyperparameters for GRPO training on the MM-Eureka dataset. Adapt the `HOME` path and other variables as needed for your environment.
.. literalinclude:: ../../examples/grpo_trainer/run_qwen2_5_vl-7b.sh
:language: bash
:caption: examples/grpo_trainer/run_qwen2_5_vl-7b.sh