Multimodal Vision Models

Multimodal vision models connect vision and language (or other modalities), enabling image-text understanding, visual question answering, image captioning, and open-vocabulary visual tasks.

Vision-Language Contrastive Learning: CLIP

Radford et al. (2021). Train a visual encoder and a text encoder jointly to align image-text pairs from 400M noisy web pairs.

Architecture:

  • Image encoder: ViT-L/14 or ResNet-based.
  • Text encoder: Transformer (GPT-2 style), max 77 tokens.

Training objective: contrastive loss over a batch of $N$ image-text pairs. Each image should be similar to its paired text and dissimilar to all $N-1$ other texts:

\[\mathcal{L}_\text{CLIP} = -\frac{1}{2N}\left(\sum_{i=1}^N \log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i, t_j)/\tau)} + \sum_{i=1}^N \log \frac{\exp(\text{sim}(t_i, v_i)/\tau)}{\sum_j \exp(\text{sim}(t_i, v_j)/\tau)}\right)\]

Zero-shot classification: encode each class name as a text prompt (“a photo of a {class}”); classify by nearest-neighbor in the joint embedding space.

Impact: CLIP embeddings are highly transferable. Used as the backbone for:

  • Zero-shot classification
  • Image retrieval
  • Open-vocabulary detection/segmentation
  • Diffusion model conditioning (Stable Diffusion)
  • Visual question answering via language model integration

CLIP Variants

OpenCLIP: open-source CLIP trained on LAION-2B. Comparable or superior to original CLIP.

SigLIP: sigmoid loss instead of softmax contrastive. Each pair is positive or negative independently; no normalization over the batch. More stable; better with large batches.

DFN (Data Filtering Networks, Apple): learn to filter web data for quality before CLIP training. Strong data curation improves final model quality significantly.

Image Captioning

Generate a natural language description of an image.

BLIP (Bootstrapping Language-Image Pretraining): unified model for captioning, retrieval, and VQA. Introduces CapFilt: a captioner generates synthetic captions for web images; a filter removes noisy captions. Uses a shared image-text encoder + text decoder.

BLIP-2: efficient architecture that bridges a frozen image encoder (ViT) and a frozen LLM via a lightweight Q-Former (cross-attention + self-attention layers that attend to a small set of learned query tokens). Achieves strong performance with minimal trainable parameters.

Flamingo (DeepMind): few-shot image-text model. Interleave image features into a pretrained language model via cross-attention layers added at regular intervals. Supports interleaved image-text sequences; strong few-shot learner.

Visual Question Answering (VQA)

Given an image and a natural language question, produce an answer.

VQA v2 benchmark: 204k images, 1.1M questions, balanced to reduce language bias.

Early fusion approaches: concatenate visual features with question encoding; classify over answer vocabulary.

MCAN (Deep Modular Co-Attention Networks): stacked self-attention over questions + cross-attention between question and image features.

Modern approach (LLM + visual encoder): project image tokens into the LLM token space; auto-regressively generate the answer.

Metrics: accuracy over the most common answer (open-ended); BLEU/CIDER for generated answers.

Large Vision-Language Models (LVLMs)

Combine a powerful visual encoder with an instruction-following LLM.

LLaVA

Liu et al. (2023). Connect CLIP ViT encoder to LLaMA via a simple linear projection. Fine-tune on GPT-4-generated visual instruction data.

LLaVA-1.5: replace linear projection with MLP; use CLIP ViT-L/336px. Strong VQA and OCR.

LLaVA-Next: dynamic high-resolution input; split high-res images into tiles for better detail.

InstructBLIP

Extend BLIP-2 with instruction tuning. Q-Former attends to image features; outputs are fed to a frozen LLM. Strong zero-shot performance.

GPT-4V / GPT-4o

OpenAI multimodal models. Accept interleaved image-text inputs; produce rich, detailed descriptions, reasoning, and code. Support multiple images per prompt; strong at OCR, chart reading, spatial reasoning.

Gemini

Google multimodal architecture. Gemini 1.5 Pro: 1M token context with native video, audio, image, and text understanding. Trained natively multimodal from the start (not a retrofit).

Open-Vocabulary Detection and Segmentation

Extend detection and segmentation to arbitrary categories described in text, not just a fixed training set.

Grounding DINO: DETR-based detector conditioned on text. Detect any category described in natural language. Trained on grounding datasets (COCO, Flickr30k) + detection datasets.

OWL-ViT: CLIP ViT backbone; fine-tune with detection heads conditioned on text embeddings. Zero-shot open-vocabulary detection.

GLIP: unify grounding and detection as phrase grounding. Align phrase tokens and region features contrastively.

SAM (Segment Anything): prompt-based segmentation (point, box, mask). Open-vocabulary segmentation when combined with a text-grounded detector.

SEEM / ODISE: unified open-vocabulary detection + segmentation with text and visual prompts.

Image-Text Retrieval

Retrieve the most relevant image for a text query (or vice versa).

CLIP-based retrieval: encode query and database; rank by cosine similarity. Zero-shot but limited by contrastive pretraining.

FAISS index: build an ANN index over encoded image embeddings for fast retrieval from millions of images.

MS-COCO retrieval: standard benchmark. R@1/5/10 measures recall at rank 1, 5, 10.

Evaluation of Multimodal Models

Benchmark What it tests
VQAv2 Visual question answering
GQA Compositional spatial reasoning
TextVQA OCR + reasoning
MMBench Broad multimodal capabilities
MMMU Multi-discipline expert knowledge
SeedBench 12 evaluation dimensions
HallusionBench Hallucination robustness
LLaVA-Bench Detailed visual descriptions

Hallucination in LVLMs: models generate descriptions of objects not present in the image. Evaluated with POPE (polling-based object probing) and corrected via RLHF-style preference tuning (LLaVA-RLHF, RLAIF-V).