Agents and Agent Frameworks

An AI agent is a system that uses a language model to perceive inputs, reason about them, and take actions in the world to accomplish goals. Agents extend LLMs beyond single-turn Q&A to multi-step, goal-directed behavior.

What Makes an Agent

Perception: the agent receives inputs: user messages, tool outputs, environment observations, documents.

Memory: the agent maintains context across steps: conversation history, retrieved knowledge, working memory.

Reasoning: the LLM plans which actions to take and in what order.

Action: the agent executes actions: calling tools, writing to files, browsing the web, running code, calling APIs.

Feedback loop: action results are returned to the agent; it continues reasoning until the goal is achieved.

ReAct (Reason + Act)

Yao et al. (2022). Interleave reasoning traces with actions in the model’s output.

Thought: I need to find the population of Paris.
Action: search("Paris population 2024")
Observation: The population of Paris is approximately 2.1 million.
Thought: Now I have the answer.
Answer: The population of Paris is approximately 2.1 million.

The thought-action-observation loop is the foundation of most agent implementations.

Tool-Calling Agents

Modern LLMs support structured tool calls natively (OpenAI function calling, Claude tool use, Gemini function calling).

The model receives a list of available tools with JSON schemas; outputs a JSON object specifying the tool and arguments; the framework executes the tool and returns the result.

Parallel tool calling: the model can call multiple tools simultaneously if they are independent, reducing total latency.

Memory Types

In-context (working) memory: everything in the current prompt window. Fast; limited by context length.

External memory (retrieval): a vector store queried by the agent. Unlimited size; requires explicit retrieval.

Episodic memory: logs of past agent runs. Retrieved to inform current behavior (“last time you asked this, I found X”).

Semantic memory: a knowledge base of facts. Updated via tool use; queried via RAG.

Planning Strategies

Single-shot planning: generate the full plan in one step; execute each step. Fails if the plan is infeasible or if step results change subsequent steps.

ReAct (dynamic planning): re-plan after each tool result. More robust; slower.

Self-refinement: the agent critiques its own output and refines it. Improves output quality at the cost of additional LLM calls.

Tree of Thoughts + agent: explore multiple plan branches; evaluate; pursue the most promising.

MCTS for planning: build a search tree of possible action sequences; evaluate with a reward function; select the best path.

Multi-Agent Systems

Orchestrator-subagent: an orchestrator LLM decomposes the task and delegates to specialized subagents (a coding agent, a research agent, a data analysis agent).

Debate: two agents argue different positions; a judge synthesizes. Improves factual accuracy.

Critic-actor: one agent proposes; another critiques; iterate. Common for code generation and complex writing.

Agent Frameworks

Framework Key features
LangChain Tool chains, RAG, memory; large ecosystem
LlamaIndex Document-centric RAG + agents
CrewAI Multi-agent role-based collaboration
AutoGen (Microsoft) Multi-agent conversation; code execution
DSPy Optimizable LLM programs (not just prompting)
Swarm (OpenAI) Lightweight multi-agent handoffs
Pydantic AI Type-safe agent outputs and tool schemas

Code Execution Agents

Agents that write and run code to solve tasks. The code interpreter tool (ChatGPT’s Code Interpreter, Claude’s computer use) allows agents to:

  • Analyze data by writing pandas/matplotlib code.
  • Debug code by running it and reading errors.
  • Perform calculations that LLMs cannot do in-weights.

Sandboxing: code execution must be sandboxed to prevent malicious code from affecting the host system. Docker containers, E2B, Modal, or Pyodide (WebAssembly) provide isolation.

Computer Use Agents

Agents that control a computer: clicking buttons, typing text, reading the screen.

Claude Computer Use (Anthropic 2024): the model receives screenshots and outputs actions (click, type, scroll). Enables autonomous task completion in any GUI-based application.

Browser automation: Playwright or Selenium driven by an agent. Web browsing, form filling, data extraction.

Evaluation of Agents

Task completion rate: fraction of tasks completed successfully end-to-end.

Number of steps: fewer steps = more efficient.

Cost: total token cost and tool API cost per task.

Human-in-the-loop: how often does the agent need clarification?

Benchmarks: SWE-bench (software engineering), GAIA (general AI assistant), WebArena (web tasks), OSWorld (computer use), TAU-bench (tool use).

SWE-bench: 2294 GitHub issues; the agent must produce a code patch that resolves the issue. Top agents (mid-2024): 12-50% resolution rate.