Flatness Preserves Instruction Following in
Vision-Language-Action Models
Abstract
Vision-language-action (VLA) models have the potential for open-world generalization by leveraging pretrained vision-language representations, yet downstream finetuning on limited robot data often degrades these representations, leading to brittle policies that ignore language instructions in favor of visual shortcuts, a failure mode we term instruction blindness. We hypothesize that standard finetuning with limited data applies gradients to a sparse set of points, which manifests as a sharp loss landscape with high-curvature minima. We propose to address this directly through flatness-preserving optimization while finetuning on the exact same data, where learning a flatter landscape results in a model more robust to perturbations in the weight space. Specifically, we demonstrate that simply applying sharpness-aware minimization (SAM) during VLA finetuning significantly improves instruction following by over 60% across multiple simulation and real-world benchmarks without additional data, architectural modification, or retraining. We further analyze the effect of selective sharpness, quantify its effects, and show that our approach is complementary to existing guidance techniques.
Representation Analysis
Object-token representation probe: standard finetuning disrupts semantic clusters; SAM finetuning preserves pretrained VLM structure.
Video Presentation
Simulation Results
LIBERO counterfactual rollouts: π₀.₅ vs. π₀.₅ + SAM on the same instruction.
Pick the butter and place it in the basket
π₀.₅
π₀.₅ + SAM
Pick the cream cheese and place it in the basket
π₀.₅
π₀.₅ + SAM
Pick the orange juice and place it in the basket
π₀.₅
π₀.₅ + SAM
Pick the salad dressing and place it in the basket
π₀.₅
π₀.₅ + SAM
Real-World Results
DROID pick-and-place: π₀.₅ + SAM improves grounding and task success on counterfactual instructions, including with background perturbations.
Counterfactual pick with towel background
π₀.₅
π₀.₅ + SAM
Counterfactual pick-and-place
π₀.₅
π₀.₅ + SAM
Paper
BibTeX
@inproceedings{zhang2026flatness,
title={Flatness Preserves Instruction Following in Vision-Language-Action Models},
author={Zhang, Haochen and Bisk, Yonatan},
booktitle={Conference on Robot Learning (CoRL)},
year={2026},
url={https://HaochenZ11.github.io/papers/flatness-vla/}
}