Stop paying twice to see the same pixels.
Most of what a video VLM is told to ingest is evidence the stack already paid for. We reuse the model’s working state across turns and only refresh what actually changed — training-free, no measurable accuracy drift, on frozen open-weights models.
What it actually does
- Most of what a video VLM is asked to ingest is what it already ingested. The factory wall did not move; the cache says so.
- Snapshot the model's working state right after it has seen the video, then reuse that snapshot across follow-up questions instead of re-ingesting the video each time.
- Tested on Gemma 4 26B-A4B and Qwen 2.5-VL-7B-4bit, frozen weights, M5-class hardware. Byte-paired against cold-dense baselines.
- Training-free. No new weights, no fine-tune, no distillation. The mechanism is in how the cache is borrowed across turns.
Numbers
| Model | Frames / turn | Median speedup | Median fps | n |
|---|---|---|---|---|
| Gemma 4 26B-A4B | 8 | 9.11× | 27.0 | 21 |
| Gemma 4 26B-A4B | 32 | 18.7× | 54.7 | 21 |
| Qwen 2.5-VL-7B-4bit | 20 · short/med/long | 14.9–35.9× | — | 93 |
All rows are paired against cold-dense baselines on the same frames, prompt, and decoding seed. Engineering caveats — including the cache-correctness boundary on mixed-attention models and the upstream mlx-vlm fix that closes it — are written up in the paper.
See it
Per-clip routing-budget overlays. Orange highlights what the runtime actually paid to re-ingest on each frame; everything else is reused state.
Why it matters
Continuous-video agents — computer-use, screen recording, robotics — need to observe at 30 fps or higher. If the model re-ingests the entire scene on every decision, you cannot keep up. Reusing what already happened lets a 26B-class open-weights VLM perceive at 24–134 fps per follow-up turn, which is the throughput regime where 30 fps observation actually becomes tractable on a laptop.
The bigger picture: most video pipelines hand the model dense pixels every frame and ask it to rediscover what didn’t move. Almost everything in a real scene didn’t move. Anti-recomputation is just the cache-side answer to that waste — and it gets larger every time the input rate goes up.