CVPR Findings 2026

Can Textual Reasoning Improve the
Performance of MLLMs on
Fine-grained Visual Classification?

Jie Zhu1,  Yiyang Su1,  Xiaoming Liu1,2

1Michigan State University  2University of North Carolina at Chapel Hill

TL;DR We uncover the "Cost of Thinking" in fine-grained visual classification — longer CoT reasoning consistently hurts MLLM accuracy. We propose ReFine-RFT, a reinforcement fine-tuning framework with ensemble rewards and Multi-Reward Normalization (MRN) that constrains reasoning length while boosting accuracy, achieving state-of-the-art on four FGVC benchmarks.

Abstract

Multi-modal large language models (MLLMs) still struggle on Fine-Grained Visual Classification (FGVC), where subtle visual discrimination is essential. Chain-of-Thought (CoT) reasoning is widely adopted to boost performance on challenging tasks, yet prior works have shown it can harm visual perception accuracy. We systematically re-examine the role of textual reasoning in FGVC across zero-shot and RFT settings, uncovering a central paradox: reasoning length is the key factor — longer CoT consistently lowers classification accuracy, a phenomenon we term the "Cost of Thinking".

Building on this, we propose ReFine-RFT, combining an ensemble reward with Multi-Reward Normalization (MRN) to constrain reasoning length while providing dense, accuracy-centric feedback. Extensive experiments achieve state-of-the-art results across FGVC benchmarks.

1
"Cost of Thinking"

Verbose CoT systematically degrades MLLM performance on fine-grained perception — thinking length, not CoT quality, is the key factor.

2
Multi-Reward Normalization (MRN)

A plug-and-play normalization module that independently normalizes heterogeneous reward signals before aggregation, stabilizing multi-objective RFT training.

3
ReFine-RFT Framework

Integrates MRN with an ensemble reward (format, accuracy, thinking length, MLLM-based, embedding similarity) to optimize visual accuracy while controlling reasoning length.

4
State-of-the-Art on FGVC

ReFine-RFT achieves 86.5% average accuracy (+3.6% vs Visual-RFT) on four FGVC benchmarks with only a 2B backbone in 4-shot settings.

Motivation: The Cost of Thinking

CoT prompting consistently leads to accuracy degradation across all FGVC datasets. Non-reasoning models exhibit an average drop of 3–6% when switching from Answer-only to CoT prompts. More strikingly, during RFT, models exhibit Reasoning Collapse — they gradually suppress verbose reasoning while optimizing for accuracy.

Performance degradation with CoT and Reasoning Collapse in RFT
Fig 1 Performance degradation with CoT and Reasoning Collapse. Zero-shot CoT leads to wrong answers despite correct direct prediction (top). During RFT, reasoning length steadily shrinks while accuracy improves — an emergent Reasoning Collapse (bottom).
Dynamics of reasoning length during RFT
Fig 2 Reasoning length dynamics during RFT. Completion length rapidly decreases and stabilizes at a compact range across all four FGVC datasets, suggesting RFT implicitly discourages excessive reasoning.
Impact of reasoning length on FGVC performance
Fig 3 Reasoning length vs. accuracy. As average thinking length increases, classification accuracy consistently declines across all FGVC benchmarks — a clear negative correlation.
🔍 Finding 1

For fine-grained visual tasks, thinking length is the key factor: excessive reasoning hurts performance, and MLLMs benefit more from concise responses than from elaborate reasoning.

Method: ReFine-RFT

Inspired by our Cost of Thinking analysis, we propose ReFine-RFT: a reinforcement fine-tuning framework that jointly constrains reasoning length and provides dense accuracy-centric feedback. It features two key components: an Ensemble Reward for richer supervision, and Multi-Reward Normalization (MRN) for stable multi-objective optimization.

Overview of ReFine-RFT framework
Fig 4 Overview of ReFine-RFT. For each question, the MLLM generates G candidate responses. Each response is scored by a hybrid ensemble reward (rule-based + model-based). MRN then independently normalizes each reward component before aggregation, producing the final advantage used in GRPO policy updates.

Ensemble Reward

Standard binary accuracy rewards offer sparse supervision and fail to capture semantic similarity (e.g., "thorn apple" vs. "Datura stramonium" denote the same species). We design an ensemble of five complementary reward signals to provide richer, more accuracy-centric feedback:

📐

Format Reward Rf

Binary reward enforcing adherence to the structured <think>...</think><answer>...</answer> output template.

🎯

Classification Reward Rcls

Binary accuracy reward: yields 1 when the predicted class extracted from the answer tag matches ground truth, 0 otherwise.

📏

Thinking Length Reward Rlen

Assigns 1 when reasoning length falls within [Lmin, Lmax] = [0, 10], explicitly constraining verbosity during training.

🤖

MLLM Accuracy Reward Rmllm

A teacher MLLM grades each prediction on [0, 1] based on semantic alignment with the ground-truth label, capturing subcategory-level nuances.

🔗

Embedding Similarity Reward Remb

Cosine similarity between the predicted and ground-truth answer embeddings (via E5), providing a smooth, continuous semantic supervision signal.

Multi-Reward Normalization (MRN)

When multiple reward signals are combined in standard GRPO, they are first summed and then normalized as a single scalar. This causes fast-saturating rewards (e.g., format reward) to dominate early training and dilute the influence of slower, more informative rewards (e.g., accuracy). MRN addresses this by normalizing each reward independently before aggregation, placing all rewards on a comparable scale throughout training.

MRN Algorithm — Advantage Normalization with Multi-Reward Normalization
Algorithm 1: Advantage Normalization with MRN. MRN normalizes each reward independently before aggregation, in contrast to standard GRPO which aggregates first and normalizes once. This prevents fast-saturating rewards from dominating the gradient signal.
💡 Conclusion 1

For fine-grained visual perception, jointly using multi-perspective, accuracy-centric rewards and explicit reasoning length control leads to stronger visual perception capabilities.

Experimental Results

ReFine-RFT is evaluated in a 4-shot setting on four FGVC benchmarks: Aircrafts-102, Flowers-102, Cars-196, and Oxford-Pets-37. Using Qwen2-VL-2B-Instruct as the base, ReFine-RFT with LoRA achieves 86.5% average accuracy, a +3.6% improvement over Visual-RFT.

Performance Comparison on FGVC Benchmarks (4-shot, Top-1 Accuracy %)
Methods FT Method FT Type Aircrafts-102 Flowers-102 Cars-196 Pets-37 Average
Qwen2-VL-2B Zero-shot 45.954.856.866.456.0
Finedelics-8B SFT Fully-FT 63.889.984.792.282.7
SFT-AO SFT Fully-FT 67.958.540.555.555.6
SFT-AO Lora 78.374.880.087.680.2
SFT-CoT Lora 73.974.452.387.572.0
Visual-RFT RFT Fully-FT 74.871.495.386.181.9
Visual-RFT Lora 75.674.195.786.082.9
No-Thinking-RFT Fully-FT 71.286.1
ReFine-RFT-AO (Ours) Lora 78.781.493.187.685.2
ReFine-RFT-CoT (Ours) RFT Lora 79.3 (+3.7%) 81.0 (+6.9%) 97.1 (+1.4%) 88.6 (+2.6%) 86.5 (+3.6%)

Analysis

Reward curves on Flowers-102
Fig 6 Reward curves on Flowers-102. All reward components and validation accuracy consistently increase during training, demonstrating the effectiveness of the ensemble reward design.
MRN vs GRPO reward stability on Aircrafts-102
Fig 7 MRN vs GRPO reward stability. MRN+GRPO achieves consistently higher reward values and significantly lower standard deviation throughout training, indicating improved optimization stability.
Qualitative comparison of model responses
Fig 8 Qualitative comparison. SFT-CoT and Visual-RFT produce long reasoning chains with incorrect answers. ReFine-RFT achieves concise reasoning and higher accuracy — correctly identifying the aircraft model and cat breed that others miss.
🔑 Key Takeaway

ReFine-RFT with only a 2B backbone and 4-shot training significantly surpasses Finedelics-8B trained on the full FGVC datasets, underscoring the efficiency and scalability of our approach. The gain comes not from using more data or parameters, but from better reward shaping and reasoning control.

BibTeX

@inproceedings{zhu2026refinerft, title = {Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?}, author = {Zhu, Jie and Su, Yiyang and Liu, Xiaoming}, booktitle = {CVPR Findings}, year = {2026}, }