LLaVA 1.5: An open source alternative to GPT-4 Vision
The rapid development of multimodal language models (LMM) marks a turning point in the history of generative artificial intelligence. This evolution, embodied by OpenAI’s GPT-4 Vision, takes on a new dimension with the arrival of LLaVA 1.5, a promising open source solution. Let’s dive into this dynamic where innovation and accessibility go hand in hand.
The mechanics of LMM
LMMs operate using a multi-layer architecture. They combine a pre-trained model for processing visual elements, a large language model (LLM) for understanding and responding to user instructions, and a multimodal connector for linking vision and language.
Their training takes place in two stages: an alignment phase between vision and language, followed by fine adjustment to respond to visual requests. This process, although efficient, often requires significant computing resources and depends on a rich and accurate database.
The advantages of LLaVA 1.5
LLaVA 1.5 uses the CLIP model for visual processing and Vicuna for language. Unlike the original model, LLaVA, which was based on the text versions of ChatGPT and GPT-4, LLaVA 1.5 connects the language model and the visual encoder using a multi-layer perceptron (MLP). This update allowed LLaVA 1.5 to outperform other open source LMMs on 11 of 12 multimodal benchmarks, thanks to the addition of approximately 600,000 examples to its training database.
The future of open source LMMs
The online demo of LLaVA 1.5, accessible to everyone, shows promising results, even on a limited budget. However, it should be noted that the use of data generated by ChatGPT is limited to non-commercial purposes.
Despite this restriction, LLaVA 1.5 paves the way for the future of open source LMMs. Its cost-effectiveness, ability to generate scalable training data and efficiency in processing visual instructions make it a precursor of future innovations.
LLaVA 1.5 is just the first step in a series of developments that will keep pace with advances in the open source community. By anticipating more efficient and accessible models, we can envision a future where generative AI technology is within everyone’s reach, revealing the limitless potential of artificial intelligence.