LLM / Agentic Systems

Large Language Models and Agents: From Next Tokens to Verifiable Action Loops

An LLM models conditional token distributions; an agent combines the model with retrieval, tool calls, state updates, and verifiers into an auditable task system.

Mechanism Lab

Animation: how an LLM becomes an agent with retrieval, tools, and verification

The animation moves from prompt context into token logits, then adds RAG memory, tool schemas, execution traces, and verifiers so a single generation becomes an auditable loop.

Step 1 / 5

Context

Task, system constraints, history, and retrieved materials enter the context window.

C=[task,system,history,docs]

Animation Control

Reduced-motion users receive the same step states without continuous motion.

01 / Intuition

Core Intuition

An LLM is not directly a program that understands the world. It models the next token under context C: p(x_t | x_{<t}, C). Capabilities come from pretraining, instruction tuning, preference optimization, and in-context learning.

Prompts, system messages, chat history, retrieved documents, and tool observations all enter the context window and change the conditional distribution; they do not automatically guarantee external truth.

RAG moves open-world knowledge from parameter memory into inspectable documents; tool calling constrains text into structured actions; verifiers turn generation into execute-observe-revise loops.

A reliable agent is not just a more talkative LLM. It is a state machine with permissions, tool schemas, logs, rollback behavior, and evidence checks.

02 / Math

From language-model objectives to agent state transitions

01 / Autoregressive factorization

Given context C, an LLM decomposes sequence probability into next-token conditionals. Generation repeatedly samples or selects the next token.

p(x_{1:T}|C)=prod_t p_theta(x_t | x_{<t}, C)

02 / Pretraining loss

A causal LM minimizes next-token negative log likelihood; gradients increase the softmax probability of the observed token.

L(theta)=-sum_t log p_theta(x_t^* | x_{<t}, C)

03 / Instruction and preference alignment

Instruction tuning teaches task formats and constraints; preference optimization makes preferred outputs relatively more likely.

maximize log pi_theta(y_good|x) - log pi_theta(y_bad|x)

04 / RAG marginalization

A retriever proposes documents z for query q, then the generator answers conditioned on those documents. Ideally, evidence is marginalized over retrieved documents.

p(y|q)=sum_z p_eta(z|q) p_theta(y|q,z)

05 / Tool-call actions

An agent constrains some token sequences into structured actions such as a tool name and arguments; the environment executes the action and returns an observation.

a_t={name,args}, o_t=Tool(a_t)

06 / State and verification loop

A real agent writes observations, logs, and verifier results back into state. If verification fails, it replans or asks for human confirmation.

s_{t+1}=update(s_t,a_t,o_t,V(o_t))

03 / Code

Python demo: minimal RAG, tool calling, and verification loop

This example uses a replaceable fake_llm to expose the agent structure. In a real system, the LLM API can change, but schema validation, tool execution, and verification boundaries should remain outside the model.

import math
from collections import Counter

DOCUMENTS = [
    {"id": "did", "text": "Difference-in-differences compares treated and control changes over time."},
    {"id": "psm", "text": "Propensity score matching balances observed covariates before comparing outcomes."},
    {"id": "uat", "text": "Universal approximation says a wide neural network can approximate continuous functions."},
]

def tokenize(text):
    return [word.strip(".,:;!?").lower() for word in text.split()]

def vectorize(text):
    return Counter(tokenize(text))

def cosine(a, b):
    shared = set(a) & set(b)
    dot = sum(a[key] * b[key] for key in shared)
    norm_a = math.sqrt(sum(value * value for value in a.values()))
    norm_b = math.sqrt(sum(value * value for value in b.values()))
    return 0.0 if norm_a == 0 or norm_b == 0 else dot / (norm_a * norm_b)

def retrieve(query, k=2):
    qv = vectorize(query)
    ranked = sorted(
        DOCUMENTS,
        key=lambda doc: cosine(qv, vectorize(doc["text"])),
        reverse=True,
    )
    return ranked[:k]

def validate_action(action):
    allowed = {"search": {"query"}, "draft": {"claim", "evidence_ids"}}
    if action.get("name") not in allowed:
        raise ValueError("unknown tool")
    missing = allowed[action["name"]] - set(action.get("arguments", {}))
    if missing:
        raise ValueError(f"missing arguments: {missing}")
    return action

def run_tool(action):
    args = action["arguments"]
    if action["name"] == "search":
        return retrieve(args["query"])
    if action["name"] == "draft":
        ids = ", ".join(args["evidence_ids"])
        return f"{args['claim']} Evidence: {ids}."
    raise ValueError("unreachable")

def fake_llm(state):
    # Replace this with an LLM call. Keep the action schema and verifier outside the model.
    if not state["docs"]:
        return {"name": "search", "arguments": {"query": state["task"]}}
    evidence_ids = [doc["id"] for doc in state["docs"]]
    return {
        "name": "draft",
        "arguments": {
            "claim": "Use RAG before answering empirical-method questions.",
            "evidence_ids": evidence_ids,
        },
    }

def verify(answer, docs):
    cited = {doc["id"] for doc in docs}
    return all(doc_id in answer for doc_id in cited)

state = {"task": "Explain DID and PSM for a research assistant.", "docs": [], "trace": []}

for step in range(3):
    action = validate_action(fake_llm(state))
    observation = run_tool(action)
    state["trace"].append({"action": action, "observation": observation})
    if action["name"] == "search":
        state["docs"] = observation
    else:
        if verify(observation, state["docs"]):
            state["answer"] = observation
            break
        state["task"] += " Cite retrieved evidence explicitly."

print(state["answer"])
print("trace length:", len(state["trace"]))

04 / Case

Case: turning a StatsPAI research assistant from chat into auditable execution

  • A user asks: "Explain DID and PSM and give Stata/R implementation cautions." A one-shot LLM answer may be fluent but does not prove its sources, version, or execution path.
  • The agent version first writes the task into state, retrieves course pages and method notes, then emits structured tool calls such as search(query), open_file(path), run_code(cmd), or render_table(model).
  • Every tool call produces an observation: retrieved documents, code output, table paths, or error logs. The agent writes observations back into context before deciding whether to retrieve more, draft, run checks, or ask for human confirmation.
  • The final answer must carry checkable evidence: which knowledge pages were cited, which commands ran, whether route/build/test checks passed, and which assumptions still need human judgment. That boundary separates agents from ordinary chatbots.

05 / Risks

Common Pitfalls

Treating fluent LLM output as fact verification. The model produces high-probability text, not automatic evidence.
Stuffing everything into a prompt without retrieval, citations, and version control, making the answer impossible to audit.
Letting the model freely assemble tool arguments without schema validation, causing path errors, missing parameters, or unsafe actions.
Ignoring the truth value and failure state of tool observations, such as writing success conclusions after a command errored.
Omitting permission boundaries and human checkpoints for actions such as sending email, deleting files, or running expensive jobs.
Using a one-off demo as proof of reliability without traces, tests, rollback behavior, and failure recovery.

References