Multimodal Models

Multimodal models process and generate content across multiple modalities: text, images, audio, video, and structured data. They enable richer human-computer interaction and understanding of the real world.

Modality Combinations

Input	Output	Example models
Text	Text	GPT-4, Claude, LLaMA
Image + Text	Text	GPT-4V, Claude 3, LLaVA
Text	Image	DALL-E 3, Midjourney, Stable Diffusion
Image + Text	Image	InstructPix2Pix, SDXL
Audio	Text	Whisper, Gemini
Text	Audio	TTS models, AudioPaLM
Video + Text	Text	Gemini 1.5, GPT-4o
Text	Video	Sora, Runway Gen-3
Any	Any	GPT-4o, Gemini 1.5, Claude 3.5

Vision-Language Models

The most developed multimodal category. See Multimodal Vision Models for technical details.

Key architectures:

CLIP: contrastive image-text pretraining. Produces aligned image/text embeddings.
BLIP-2 / InstructBLIP: frozen visual encoder + Q-Former + frozen LLM.
LLaVA: visual encoder + linear/MLP projection + instruction-tuned LLM.
GPT-4V / Gemini: natively multimodal from pretraining.

Capabilities:

Image description and captioning.
Visual question answering.
Document understanding (OCR, chart reading, table extraction).
Spatial reasoning (object locations, relative positions).
Diagram and code screenshot interpretation.

Audio Understanding and Generation

ASR (Automatic Speech Recognition): transcribe spoken audio to text.

Whisper (OpenAI 2022): encoder-decoder Transformer trained on 680k hours of labeled speech. State-of-the-art open-source ASR. Supports 99 languages. Zero-shot translation to English.

Speaker diarization: who spoke when in a multi-speaker recording.

TTS (Text-to-Speech): generate natural speech from text. Neural TTS (ElevenLabs, OpenAI TTS) produces human-like voices.

AudioPaLM: integrates speech into PaLM via audio tokens. Handles speech translation and spoken Q&A.

MusicGen (Meta): generate music from text descriptions.

Video Models

Video understanding: temporal reasoning over sequences of frames. See Video Understanding.

Gemini 1.5 Pro: processes video natively as a sequence of frames + audio. 1M token context allows understanding full-length videos.

Video generation:

Sora (OpenAI 2024): text-to-video up to 1 minute; spatially and temporally coherent. DiT-based diffusion.
Runway Gen-3 Alpha: high-quality video generation; cinematic quality.
Stable Video Diffusion: image-to-video.
CogVideoX: open-source video generation model.

Document Understanding

Multimodal models that process documents as images (preserving layout, tables, charts):

Nougat (Meta): parse academic PDFs to structured Markdown including LaTeX equations.

GPT-4V / Claude 3: can read tables, extract data from charts, and understand infographics directly from image input.

Document question answering: extract specific fields from invoices, contracts, or forms.

Interleaved Multimodal Inputs

Modern models like Gemini 1.5 and GPT-4o accept arbitrarily interleaved sequences of text, images, and other modalities in a single context:

User: [image1] What's the difference between these two charts? [image2]

This enables: comparing multiple images, analyzing video frames with commentary, referencing figures in a document.

Native Multimodality vs. Retrofit

Retrofit (most VLMs): start with a pretrained LLM; attach a pretrained vision encoder; train the connection layer (projection MLP, Q-Former). Fast; leverages existing LLM capability.

Native multimodal: train from scratch on mixed text+image (+ audio/video) data. Deeper integration; better at tasks requiring tight cross-modal reasoning. Gemini, GPT-4o.

Emergent cross-modal reasoning: natively multimodal models can reason about sound from a video scene, infer 3D structure from 2D images, and relate text descriptions to specific image regions in ways that retrofit models struggle with.

Evaluation

VQAv2, GQA, TextVQA: visual question answering benchmarks.

MMBench, MMMU: broad multimodal capabilities.

OCRBench: OCR and document understanding.

Video-MME: video understanding evaluation.

Human evaluation: for open-ended tasks (describe this image; compare these videos), automated metrics are weak. Human side-by-side evaluation or LLM-as-judge are common.