Multimodal Models

Multimodal models process and generate content across multiple modalities: text, images, audio, video, and structured data. They enable richer human-computer interaction and understanding of the real world.

Modality Combinations

Input Output Example models
Text Text GPT-4, Claude, LLaMA
Image + Text Text GPT-4V, Claude 3, LLaVA
Text Image DALL-E 3, Midjourney, Stable Diffusion
Image + Text Image InstructPix2Pix, SDXL
Audio Text Whisper, Gemini
Text Audio TTS models, AudioPaLM
Video + Text Text Gemini 1.5, GPT-4o
Text Video Sora, Runway Gen-3
Any Any GPT-4o, Gemini 1.5, Claude 3.5

Vision-Language Models

The most developed multimodal category. See Multimodal Vision Models for technical details.

Key architectures:

  • CLIP: contrastive image-text pretraining. Produces aligned image/text embeddings.
  • BLIP-2 / InstructBLIP: frozen visual encoder + Q-Former + frozen LLM.
  • LLaVA: visual encoder + linear/MLP projection + instruction-tuned LLM.
  • GPT-4V / Gemini: natively multimodal from pretraining.

Capabilities:

  • Image description and captioning.
  • Visual question answering.
  • Document understanding (OCR, chart reading, table extraction).
  • Spatial reasoning (object locations, relative positions).
  • Diagram and code screenshot interpretation.

Audio Understanding and Generation

ASR (Automatic Speech Recognition): transcribe spoken audio to text.

Whisper (OpenAI 2022): encoder-decoder Transformer trained on 680k hours of labeled speech. State-of-the-art open-source ASR. Supports 99 languages. Zero-shot translation to English.

Speaker diarization: who spoke when in a multi-speaker recording.

TTS (Text-to-Speech): generate natural speech from text. Neural TTS (ElevenLabs, OpenAI TTS) produces human-like voices.

AudioPaLM: integrates speech into PaLM via audio tokens. Handles speech translation and spoken Q&A.

MusicGen (Meta): generate music from text descriptions.

Video Models

Video understanding: temporal reasoning over sequences of frames. See Video Understanding.

Gemini 1.5 Pro: processes video natively as a sequence of frames + audio. 1M token context allows understanding full-length videos.

Video generation:

  • Sora (OpenAI 2024): text-to-video up to 1 minute; spatially and temporally coherent. DiT-based diffusion.
  • Runway Gen-3 Alpha: high-quality video generation; cinematic quality.
  • Stable Video Diffusion: image-to-video.
  • CogVideoX: open-source video generation model.

Document Understanding

Multimodal models that process documents as images (preserving layout, tables, charts):

Nougat (Meta): parse academic PDFs to structured Markdown including LaTeX equations.

GPT-4V / Claude 3: can read tables, extract data from charts, and understand infographics directly from image input.

Document question answering: extract specific fields from invoices, contracts, or forms.

Interleaved Multimodal Inputs

Modern models like Gemini 1.5 and GPT-4o accept arbitrarily interleaved sequences of text, images, and other modalities in a single context:

User: [image1] What's the difference between these two charts? [image2]

This enables: comparing multiple images, analyzing video frames with commentary, referencing figures in a document.

Native Multimodality vs. Retrofit

Retrofit (most VLMs): start with a pretrained LLM; attach a pretrained vision encoder; train the connection layer (projection MLP, Q-Former). Fast; leverages existing LLM capability.

Native multimodal: train from scratch on mixed text+image (+ audio/video) data. Deeper integration; better at tasks requiring tight cross-modal reasoning. Gemini, GPT-4o.

Emergent cross-modal reasoning: natively multimodal models can reason about sound from a video scene, infer 3D structure from 2D images, and relate text descriptions to specific image regions in ways that retrofit models struggle with.

Evaluation

VQAv2, GQA, TextVQA: visual question answering benchmarks.

MMBench, MMMU: broad multimodal capabilities.

OCRBench: OCR and document understanding.

Video-MME: video understanding evaluation.

Human evaluation: for open-ended tasks (describe this image; compare these videos), automated metrics are weak. Human side-by-side evaluation or LLM-as-judge are common.