DiVE-k: Differential Visual Reasoning for Fine-Grained Image Recognition

💡

TL;DR

We fine-tune LVLMs for fine-grained image recognition by turning the model's own top-k confusions into multiple-choice questions and training with RL — forcing differential reasoning among visually similar categories without brittle string-match rewards. DiVE-k outperforms prior methods by over 10% on the Harmonic Mean metric.

Abstract

Large Vision Language Models (LVLMs) possess extensive text knowledge but struggle to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact string-match reward are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes.

To address this, we propose DiVE-k (Differential Visual rEasoning using top-k generations), a framework that leverages the model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization.

Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.

Fine-Grained Image Recognition Vision Language Models Reinforcement Learning GRPO Differential Reasoning Zero-Shot Generalization

Summary

Fine-grained recognition requires identifying subtle visual differences between semantically similar categories — a task where existing LVLM fine-tuning methods fall short.

For fine-grained image recognition tasks, the most salient visual attributes are often insufficient to identify the correct category. (a) The base model shows a significant gap between Pass@1 and Pass@20 accuracy, revealing that the correct label is present in the top-k outputs but is not selected. (b) Using top-k outputs as explicit options elicits differential reasoning, guiding the model to identify fine-grained discriminative attributes (highlighted in green).

DiVE-k Framework

DiVE-k employs a simple two-step strategy that elicits differential reasoning in LVLMs using the model's own top-k generations as a training signal.

An overview of the DiVE-k framework. Step 1 (red box): Offline top-k option mining — sample K rollouts from a pretrained LVLM and select the top-k options by frequency, ensuring the ground-truth appears. Step 2 (green box): RL training with GRPO on MCQ prompts — the model receives an image, natural language prompt, and k options as input, and is optimized with a simple, verifiable reward combining MCQ correctness and format compliance.

Top-k Option Mining

For each training image, we sample K rollouts from a pretrained LVLM using nucleus sampling. We rank candidate categories by frequency and select the top-k most frequent predictions, always including the ground-truth category.

MCQ Construction

The top-k predictions are structured as a Multiple-Choice Question (MCQ) with randomly shuffled option labels (A, B, C, …) to avoid option-order bias. The ground-truth label â is the correct option.

RL Training via GRPO

The policy model is trained with GRPO on the MCQ dataset. The reward combines MCQ correctness (r_mcq) and format compliance (r_format), providing a simple, verifiable training signal that mitigates category memorization.

Inference Pipeline

At inference, DiVE-k uses a two-step pipeline (right of dotted line): the trained policy first generates top-k candidate options, then re-prompted to select the correct option — enabling richer reasoning. This contrasts with ViRFT's direct one-pass prediction (left of dotted line).

Qualitative Examples

DiVE-k generates fine-grained differential reasoning traces that identify discriminative visual attributes, leading to correct category predictions.

Qualitative comparison across three domains—aircraft, birds, and cars. In each pair, ViRFT commits to a plausible but incorrect class with generic rationale such as “BAe 146-300”, “Great Crested Flycatcher”, “2012 Ford F-150”. Our method first enumerates top-k candidates and then applies attribute-grounded differential reasoning such as T-tail/registration cues for BAe 146- 200; Empidonax traits for Least Flycatcher; grille/headlight era cues for a 2007 F-150, yielding the correct fine-grained label and a justification aligned with the final choice.

BibTeX

If you find DiVE-k useful in your research, please consider citing our paper.

@inproceedings{
        kumar2026divek,
        title={Di{VE}-k: {DIFFERENTIAL} {VISUAL} {REASONING} {FOR} {FINE}-{GRAINED} {IMAGE} {RECOGNITION}},
        author={Raja Kumar and Arka Sadhu and Ram Nevatia},
        booktitle={The Fourteenth International Conference on Learning Representations},
        year={2026},
        url={https://openreview.net/forum?id=flE6M5zFL6}
        }