Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Want more intelligent insights in your input mail? Sign up for our weekly newsletters to get only what matters to AI, data and security leaders. Subscribe now
Increasing the deep performance characteristics and other AI-nourishing analyzes have generated more models and services that want to simplify this process and read more of the documents that the business actually uses.
The Canadian Ai Company Coher is involved in models, including a newly visual model, to make the functions for deep research and must be optimized for cases of use of businesses.
The company released a vision command, a visual model specifically aimed at the cases of use of the enterprise built on the back of its command model. The 112 billion parameters model can « unlock valuable information from visual data and make highly accurate decisions managed by data by optical document recognition (OCR) and image analysis, » the company says.
« Whether it is complicated diagram products or analysis of scenes photos in the real world to detect risk, command vision is distinguished by dealing with the most demanding challenges for the company’s vision, » said the company in a blog publication.
AI Impact series is back in San Francisco – August 5
The next phase of AI is here – are you ready? Join leaders from Block, GSK and SAP for an exceptional view of how autonomous agents reshape the working processes of businesses from making real-time decisions to end-to-end automation.
Attach your place now – space is limited: https://bit.ly/3guuplf
This means that the vision command can read and analyze the most common types of images that businesses need: graphics, diagrams, diagrams, scanned documents and PDF.
Since it is built on the architecture of command A, the vision command requires two or less graphic processors, just like the text model. The vision model also retains the text possibilities of command A to read words of images and understands at least 23 languages. Cohere said that unlike other models, the vision command reduces the general cost of ownership of businesses and is completely optimized for cases of use of business extraction.
Cohere said that Llava’s architecture followed to build his models, including the visual model. This architecture turns visual characteristics into soft vision tokens, which can be divided into different tiles.
These tiles are transmitted to the command text tower, « solid, 111b parameters Textual LLM, » the company said. « In this way, an image consumes up to 3.328 tokens. »
Cohere said he trained the visual model in three stages: leveling the vision language, controlled fine tuning (SFT) and training after training for human feedback (RLHF).
« This approach allows the mapping of the features of the image of the image to the space for the installation of the language model, » the company said. « In contrast, during the SFT stage, we simultaneously trained the vision encoder, the vision adapter and the linguistic model of a varied set of multimodal tasks that follow the instructions. »
Comparison tests have shown that the vision command is superior to other models with similar visual capabilities.
Complete command of the OpenAi GPT 4.1 command, Meta’s Llama 4 Maverick, Pixtral Emary and Mistral Mistral medium in nine comparison tests. The company did not mention if it tests the model against API focused on Mistral, Mistral OCR.
Command vision exceeds other models in tests such as Chartqa, Ocrbench, AI2D and Textvqa. Overall, the vision command has an average of 83.1% compared to 78.6% on GPT 4.1, 80.5% on Llama 4 Maverick and 78.3% of Mistral environment 3.
Most large language models (LLM) these days are multimodal, which means that they can generate or understand visual media such as photos or videos. However, businesses usually use more graphics documents such as diagrams and PDF files, so retrieval of information from these unstructured data sources is often difficult.
With deep research on the rise, the importance of attracting models capable of reading, analyzing and even downloading unstructured data has grown.
Cohere also said it offers a vision command in an open weight system in the hope that businesses that want to move away from closed or own models to start using their products. So far, there is some interest from the developers.