Flatness Preserves Instruction Following in
Vision-Language-Action Models

Carnegie Mellon University CoRL 2026
Flatness-preserving finetuning improves instruction following in VLAs

Flatness-preserving finetuning minimizes sharp loss peaks from limited data and improves instruction following on counterfactual tasks.

Abstract

Vision-language-action (VLA) models have the potential for open-world generalization by leveraging pretrained vision-language representations, yet downstream finetuning on limited robot data often degrades these representations, leading to brittle policies that ignore language instructions in favor of visual shortcuts, a failure mode we term instruction blindness. We hypothesize that standard finetuning with limited data applies gradients to a sparse set of points, which manifests as a sharp loss landscape with high-curvature minima. We propose to address this directly through flatness-preserving optimization while finetuning on the exact same data, where learning a flatter landscape results in a model more robust to perturbations in the weight space. Specifically, we demonstrate that simply applying sharpness-aware minimization (SAM) during VLA finetuning significantly improves instruction following by over 60% across multiple simulation and real-world benchmarks without additional data, architectural modification, or retraining. We further analyze the effect of selective sharpness, quantify its effects, and show that our approach is complementary to existing guidance techniques.

Representation Analysis

Object-token representation probe t-SNE

Object-token representation probe: standard finetuning disrupts semantic clusters; SAM finetuning preserves pretrained VLM structure.

Video Presentation

Simulation Results

LIBERO counterfactual rollouts: π₀.₅ vs. π₀.₅ + SAM on the same instruction.

Pick the butter and place it in the basket

π₀.₅

π₀.₅ on butter task

π₀.₅ + SAM

π₀.₅ + SAM on butter task

Pick the cream cheese and place it in the basket

π₀.₅

π₀.₅ on cream cheese task

π₀.₅ + SAM

π₀.₅ + SAM on cream cheese task

Pick the orange juice and place it in the basket

π₀.₅

π₀.₅ on orange juice task

π₀.₅ + SAM

π₀.₅ + SAM on orange juice task

Pick the salad dressing and place it in the basket

π₀.₅

π₀.₅ on salad dressing task

π₀.₅ + SAM

π₀.₅ + SAM on salad dressing task

Real-World Results

Real-world DROID pick-and-place results

DROID pick-and-place: π₀.₅ + SAM improves grounding and task success on counterfactual instructions, including with background perturbations.

Counterfactual pick with towel background

π₀.₅

π₀.₅ real-world towel background task

π₀.₅ + SAM

π₀.₅ + SAM real-world towel background task

Counterfactual pick-and-place

π₀.₅

π₀.₅ real-world pick task

π₀.₅ + SAM

π₀.₅ + SAM real-world pick task

Paper

BibTeX

@inproceedings{zhang2026flatness,
  title={Flatness Preserves Instruction Following in Vision-Language-Action Models},
  author={Zhang, Haochen and Bisk, Yonatan},
  booktitle={Conference on Robot Learning (CoRL)},
  year={2026},
  url={https://HaochenZ11.github.io/papers/flatness-vla/}
}