Redmond, WA, USA, May 21, 2025 – As part of the Build 2025 conference, Microsoft Corporation introduced a significant addition to its Phi-3 family of small language models (SLMs) – the new multimodal model, Phi-3-vision. This announcement underscores the companys commitment to making powerful artificial intelligence capabilities more accessible and efficient for a wide range of tasks, including those requiring visual information analysis.
Phi-3-vision retains the compactness and efficiency of previous Phi-3 models while possessing the ability to understand and interpret both textual and visual data. The model can analyze images, answer questions about them, generate text descriptions, and perform other tasks requiring joint processing of text and images. For example, Phi-3-vision can be used to extract information from charts and graphs, create image captions, or even assist in robot navigation based on visual surroundings. The model has 4.2 billion parameters, making it lightweight enough for deployment on various devices, including mobile platforms and PCs with limited resources, without significant performance loss for its class.
Microsoft positions Phi-3-vision as an optimal solution for developers who need fast and cost-effective AI models with multimodal capabilities. The company emphasizes that Phi-3-vision was trained on high-quality filtered data to ensure reliability and reduce the risks of generating undesirable content. The model will be available through the model catalog in Azure AI Studio, simplifying its integration into various applications and services. This step is part of Microsofts broader strategy to democratize AI and provide developers with flexible tools to create the next generation of intelligent applications.
The announcement of Phi-3-vision follows the recent release of the Phi-3-mini, Phi-3-small, and Phi-3-medium text models, which have already garnered attention for their performance relative to their small size. The addition of visual capabilities significantly expands the potential use cases for the Phi-3 family, opening doors for innovation in areas such as education, accessibility, retail, and many others where the combination of textual and visual AI understanding can bring tangible benefits.