ReFine-RFT — Fine-grained Visual Classification with Reasoning Control

Abstract

Multi-modal large language models (MLLMs) still struggle on Fine-Grained Visual Classification (FGVC), where subtle visual discrimination is essential. Chain-of-Thought (CoT) reasoning is widely adopted to boost performance on challenging tasks, yet prior works have shown it can harm visual perception accuracy. We systematically re-examine the role of textual reasoning in FGVC across zero-shot and RFT settings, uncovering a central paradox: reasoning length is the key factor — longer CoT consistently lowers classification accuracy, a phenomenon we term the "Cost of Thinking".

Building on this, we propose ReFine-RFT, combining an ensemble reward with Multi-Reward Normalization (MRN) to constrain reasoning length while providing dense, accuracy-centric feedback. Extensive experiments achieve state-of-the-art results across FGVC benchmarks.

1

"Cost of Thinking"

Verbose CoT systematically degrades MLLM performance on fine-grained perception — thinking length, not CoT quality, is the key factor.

2

Multi-Reward Normalization (MRN)

A plug-and-play normalization module that independently normalizes heterogeneous reward signals before aggregation, stabilizing multi-objective RFT training.

3

ReFine-RFT Framework

Integrates MRN with an ensemble reward (format, accuracy, thinking length, MLLM-based, embedding similarity) to optimize visual accuracy while controlling reasoning length.

4

State-of-the-Art on FGVC

ReFine-RFT achieves 86.5% average accuracy (+3.6% vs Visual-RFT) on four FGVC benchmarks with only a 2B backbone in 4-shot settings.

Motivation: The Cost of Thinking

CoT prompting consistently leads to accuracy degradation across all FGVC datasets. Non-reasoning models exhibit an average drop of 3–6% when switching from Answer-only to CoT prompts. More strikingly, during RFT, models exhibit Reasoning Collapse — they gradually suppress verbose reasoning while optimizing for accuracy.

Performance degradation with CoT and Reasoning Collapse in RFT

Fig 1 Performance degradation with CoT and Reasoning Collapse. Zero-shot CoT leads to wrong answers despite correct direct prediction (top). During RFT, reasoning length steadily shrinks while accuracy improves — an emergent Reasoning Collapse (bottom).

Fig 2 Reasoning length dynamics during RFT. Completion length rapidly decreases and stabilizes at a compact range across all four FGVC datasets, suggesting RFT implicitly discourages excessive reasoning.

Impact of reasoning length on FGVC performance

Fig 3 Reasoning length vs. accuracy. As average thinking length increases, classification accuracy consistently declines across all FGVC benchmarks — a clear negative correlation.

🔍 Finding 1

For fine-grained visual tasks, thinking length is the key factor: excessive reasoning hurts performance, and MLLMs benefit more from concise responses than from elaborate reasoning.

Method: ReFine-RFT

Inspired by our Cost of Thinking analysis, we propose ReFine-RFT: a reinforcement fine-tuning framework that jointly constrains reasoning length and provides dense accuracy-centric feedback. It features two key components: an Ensemble Reward for richer supervision, and Multi-Reward Normalization (MRN) for stable multi-objective optimization.

Fig 4 Overview of ReFine-RFT. For each question, the MLLM generates G candidate responses. Each response is scored by a hybrid ensemble reward (rule-based + model-based). MRN then independently normalizes each reward component before aggregation, producing the final advantage used in GRPO policy updates.

Ensemble Reward

Standard binary accuracy rewards offer sparse supervision and fail to capture semantic similarity (e.g., "thorn apple" vs. "Datura stramonium" denote the same species). We design an ensemble of five complementary reward signals to provide richer, more accuracy-centric feedback:

📐

Format Reward R_f

Binary reward enforcing adherence to the structured <think>...</think><answer>...</answer> output template.

🎯

Classification Reward R_cls

Binary accuracy reward: yields 1 when the predicted class extracted from the answer tag matches ground truth, 0 otherwise.

📏

Thinking Length Reward R_len

Assigns 1 when reasoning length falls within [L_min, L_max] = [0, 10], explicitly constraining verbosity during training.

🤖

MLLM Accuracy Reward R_mllm

A teacher MLLM grades each prediction on [0, 1] based on semantic alignment with the ground-truth label, capturing subcategory-level nuances.

🔗

Embedding Similarity Reward R_emb

Cosine similarity between the predicted and ground-truth answer embeddings (via E5), providing a smooth, continuous semantic supervision signal.

Multi-Reward Normalization (MRN)

When multiple reward signals are combined in standard GRPO, they are first summed and then normalized as a single scalar. This causes fast-saturating rewards (e.g., format reward) to dominate early training and dilute the influence of slower, more informative rewards (e.g., accuracy). MRN addresses this by normalizing each reward independently before aggregation, placing all rewards on a comparable scale throughout training.

MRN Algorithm — Advantage Normalization with Multi-Reward Normalization

Algorithm 1: Advantage Normalization with MRN. MRN normalizes each reward independently before aggregation, in contrast to standard GRPO which aggregates first and normalizes once. This prevents fast-saturating rewards from dominating the gradient signal.

💡 Conclusion 1

For fine-grained visual perception, jointly using multi-perspective, accuracy-centric rewards and explicit reasoning length control leads to stronger visual perception capabilities.

Experimental Results

ReFine-RFT is evaluated in a 4-shot setting on four FGVC benchmarks: Aircrafts-102, Flowers-102, Cars-196, and Oxford-Pets-37. Using Qwen2-VL-2B-Instruct as the base, ReFine-RFT with LoRA achieves 86.5% average accuracy, a +3.6% improvement over Visual-RFT.

Performance Comparison on FGVC Benchmarks (4-shot, Top-1 Accuracy %)
Methods	FT Method	FT Type	Aircrafts-102	Flowers-102	Cars-196	Pets-37	Average
Qwen2-VL-2B	Zero-shot	—	45.9	54.8	56.8	66.4	56.0
Finedelics-8B	SFT	Fully-FT	63.8	89.9	84.7	92.2	82.7
SFT-AO	SFT	Fully-FT	67.9	58.5	40.5	55.5	55.6
SFT-AO		Lora	78.3	74.8	80.0	87.6	80.2
SFT-CoT		Lora	73.9	74.4	52.3	87.5	72.0
Visual-RFT	RFT	Fully-FT	74.8	71.4	95.3	86.1	81.9
Visual-RFT		Lora	75.6	74.1	95.7	86.0	82.9
No-Thinking-RFT		Fully-FT	—	71.2	—	86.1	—
ReFine-RFT-AO (Ours)		Lora	78.7	81.4	93.1	87.6	85.2
ReFine-RFT-CoT (Ours)	RFT	Lora	79.3 (+3.7%)	81.0 (+6.9%)	97.1 (+1.4%)	88.6 (+2.6%)	86.5 (+3.6%)

Analysis

Fig 6 Reward curves on Flowers-102. All reward components and validation accuracy consistently increase during training, demonstrating the effectiveness of the ensemble reward design.

MRN vs GRPO reward stability on Aircrafts-102

Fig 7 MRN vs GRPO reward stability. MRN+GRPO achieves consistently higher reward values and significantly lower standard deviation throughout training, indicating improved optimization stability.

Qualitative comparison of model responses

Fig 8 Qualitative comparison. SFT-CoT and Visual-RFT produce long reasoning chains with incorrect answers. ReFine-RFT achieves concise reasoning and higher accuracy — correctly identifying the aircraft model and cat breed that others miss.

🔑 Key Takeaway

ReFine-RFT with only a 2B backbone and 4-shot training significantly surpasses Finedelics-8B trained on the full FGVC datasets, underscoring the efficiency and scalability of our approach. The gain comes not from using more data or parameters, but from better reward shaping and reasoning control.

BibTeX

@inproceedings{zhu2026refinerft,
  title     = {Can Textual Reasoning Improve the Performance of MLLMs
               on Fine-grained Visual Classification?},
  author    = {Zhu, Jie and Su, Yiyang and Liu, Xiaoming},
  booktitle = {CVPR Findings},
  year      = {2026},
}

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Abstract

Motivation: The Cost of Thinking

Method: ReFine-RFT

Ensemble Reward

Format Reward Rf

Classification Reward Rcls

Thinking Length Reward Rlen

MLLM Accuracy Reward Rmllm

Embedding Similarity Reward Remb

Multi-Reward Normalization (MRN)

Experimental Results

Analysis

BibTeX

Can Textual Reasoning Improve the
Performance of MLLMs on
Fine-grained Visual Classification?

Format Reward R_f

Classification Reward R_cls

Thinking Length Reward R_len

MLLM Accuracy Reward R_mllm

Embedding Similarity Reward R_emb