MiniGPT-4 is an advanced large language model that enhances vision-language understanding by aligning a frozen visual encoder with a frozen Large Language Model. MiniGPT-4 design is based on a vision encoder with a pre-trained VIT and Q-former, a single linear projection layer, and an advanced Vicuna Large Language Model.
It possesses many capabilities similar to GPT-4, such as generating detailed image descriptions and creating websites from hand-written drafts.
MiniGPT-4 requires training the linear layer to align the visual features with the Vicuna model. MiniGPT-4 model has highly computationally efficient training, using approximately 5 million aligned image-text pairs.