Cohere’s New Vision Model Eats GPT-4 for Breakfast On a Budget

Move over, overpriced GPUs—this 112B-parameter beast runs on two. While OpenAI and Meta are busy flexing their trillion-parameter egos, Cohere just dropped Command A Vision, a visual model that punches above its weight class while sipping electricity like a fine scotch. Need to parse a PDF riddled with cryptic corporate diagrams? It’s got you. Want to extract text from a scanned manual that looks like it survived a coffee spill? No problem.

Why This Isn’t Just Another Multimodal Gimmick

Most “multimodal” models treat images like an afterthought—like adding a salad to a burger menu to seem healthy. Cohere’s approach? Bake vision into the damn architecture. By converting images into “soft vision tokens” and feeding them through a text-based LLM, they’ve sidestepped the usual GPU-guzzling circus. Two GPUs. That’s it. Meanwhile, GPT-4 is out here demanding a server farm just to read a bar chart.

The Benchmark Smackdown

Cohere didn’t just benchmark Command A Vision—they threw it into a gladiator pit with GPT-4.1, Llama 4 Maverick, and Mistral’s finest. The result? 83.1% average accuracy across OCR, chart analysis, and text extraction. GPT-4.1 trailed at 78.6%, probably still recovering from the compute hangover. The kicker? It’s open-weight. No more begging proprietary models for API crumbs. The bottom line: If your “enterprise AI” still can’t read a flowchart without melting down, maybe it’s time to switch teams. 🚀👀

Why This Isn’t Just Another Multimodal Gimmick

The Benchmark Smackdown

Read more

Stay in touch