AI agent glossary 2026 is a reference page for the terms builders keep seeing across agent loops, model architecture, training, retrieval, evaluation, and serving.
- Core agent terms
- Model architecture
- Training methods
- DPO (Direct Preference Optimization)
- LoRA (Low-Rank Adaptation)
- Pre-training
- QLoRA
- RLHF (Reinforcement Learning from Human Feedback)
- RLVR (Reinforcement Learning with Verifiable Rewards)
- SFT (Supervised Fine-Tuning)
- Deployment and serving
- Agent patterns
- AP2 (Agent Payments Protocol)
- CoT (Chain of Thought)
- MCP (Model Context Protocol)
- Multi-agent
- ReAct
- Reflection
- Evaluation
- RAG and retrieval
- Infrastructure
- Frequently asked questions
- What is the difference between tool use and function calling?
- Why does MCP matter in agent infrastructure?
- Where can I verify common LLM and transformer terms?
- Primary sources
Core agent terms
Agent loop
An agent loop is the repeated cycle in which a model produces an action, uses a tool or emits a response, receives the result, and continues until it reaches a stopping condition. Anthropic and OpenAI documentation both describe tool-enabled workflows where model output can trigger external actions before the next model turn. Example: a coding agent reads an issue, calls a search tool, inspects the returned files, and then proposes a patch. See also: Tool use, Function calling, Reflection, Multi-agent.
Context window
A context window is the amount of input and generated text, measured in tokens, that a model can attend to in a single interaction. In practice, it determines how much conversation history, code, or retrieved material an agent can use at once. Example: a support agent with a larger context window can ingest a longer ticket history before answering. See also: Tokenizer, RAG, Chunking, Inference.Function calling
Function calling is OpenAI‘s earlier term for structured tool invocation, where the model returns a machine-readable request instead of plain text. The concept maps closely to what many platforms now call tool use. Example: a model returns JSON arguments forget_weather(city="Paris") rather than describing the weather API in prose. See also: Tool use, System prompt, MCP, Agent loop.
Streaming
Streaming is response delivery token by token, or chunk by chunk, instead of waiting for the full completion to finish. It improves perceived latency and is common in chat UIs and agent consoles. Example: a terminal coding assistant streams a draft explanation while still generating the final patch. See also: Inference, Continuous batching, Cold start, Temperature.System prompt
A system prompt is the highest-priority instruction that sets the model’s role, constraints, and behavior for a conversation. OpenAI and Anthropic both document system-level instructions as the place to define policy, tone, and tool rules. Example: an enterprise agent’s system prompt can require citations and forbid actions without user confirmation. See also: Tool use, ReAct, Reflection, Context window.Temperature
Temperature controls sampling randomness during generation. Lower values generally make outputs more consistent, while higher values increase variation. Example: a code-generation agent often runs at low temperature to reduce unstable edits across repeated runs. See also: Inference, Pass@k, CoT, Streaming.Tool use
Tool use is the pattern where a model emits a structured request to call an external function, API, or local capability instead of only returning free text. In the AI agent glossary 2026, this is one of the most central terms because it is what turns a chat model into an action-taking system. Example: an assistant calls a calendar API with structured arguments to create a meeting. See also: Function calling, MCP, Agent loop, ReAct.
“MCP is an open protocol that standardizes how applications provide context to LLMs.”
Model Context Protocol
Why do tool use and function calling sound interchangeable?
They usually describe the same operational pattern: the model emits structured arguments for an external capability. OpenAI popularized function calling; newer platform docs often use the broader term tool use because the target may be a function, API, browser action, or local runtime capability.
Model architecture
Attention mechanism
The attention mechanism is the core transformer operation that lets a model weigh which tokens matter most when producing the next token. Hugging Face’s glossary and transformer documentation treat attention as the component that enables long-range token relationships. Example: in a long code file, attention helps the model connect a function call to its earlier definition. See also: Tokenizer, Context window, Embeddings, Inference.
Embeddings
Embeddings are dense numerical vectors that represent the semantic content of text, images, or other inputs. They are widely used for retrieval, clustering, and similarity search rather than direct generation. Example: a documentation search system stores embeddings for each section and retrieves the nearest matches to a user query. See also: RAG, Chunking, HNSW, Reranking.MoE (Mixture of Experts)
Mixture of Experts is an architecture in which only a subset of model components is activated for each token, instead of using the full parameter set every time. The goal is to increase capacity without paying the full compute cost per token. Example: an MoE model may route coding-heavy tokens to different experts than conversational tokens. See also: Inference, Quantization, Tensor parallelism, Attention mechanism.Tokenizer
A tokenizer converts raw text into tokens that a model can process, and converts generated tokens back into text. Different tokenizers split the same string differently, which affects context usage and cost. Example: a long source file may consume more tokens than its character count suggests because code punctuation is tokenized separately. See also: Context window, Attention mechanism, Pre-training, Inference.Training methods
DPO (Direct Preference Optimization)
DPO is a preference-learning method that trains a model directly from preferred versus rejected outputs, without first fitting a separate reward model in the classic RLHF pipeline. It is commonly discussed as a simpler alignment alternative in post-training stacks. Example: a lab can train a chat model to prefer concise, policy-compliant answers by optimizing on ranked response pairs. See also: RLHF, SFT, RLVR, Pre-training.
LoRA (Low-Rank Adaptation)
LoRA is a fine-tuning method that updates small low-rank adapter matrices instead of changing every model weight. In the AI agent glossary 2026, LoRA matters because it made domain adaptation much cheaper on limited hardware. Example: a team fine-tunes an open model on internal support tickets by training adapters rather than the full base model. See also: QLoRA, SFT, Quantization, Pre-training.Pre-training
Pre-training is the initial large-scale training phase on broad corpora, where a model learns general language or multimodal patterns. It usually happens before instruction tuning or preference optimization. Example: a base model learns syntax, facts, and coding patterns during pre-training before it is adapted for chat. See also: SFT, RLHF, DPO, Tokenizer.QLoRA
QLoRA combines LoRA adapters with low-bit quantization so that fine-tuning can fit on less memory while preserving much of the base model’s capability. It is widely used for adapting open-weight models on smaller GPU setups. Example: a startup fine-tunes a 7B or 13B model on one or a few GPUs by using 4-bit quantization plus LoRA adapters. See also: LoRA, Quantization, SFT, Inference.RLHF (Reinforcement Learning from Human Feedback)
RLHF is a post-training approach that uses human preference data to steer model behavior toward outputs people rate more highly. It became a standard term for alignment pipelines in chat models. Example: reviewers compare two assistant answers, and those preferences are used to improve future responses. See also: DPO, RLVR, SFT, System prompt.RLVR (Reinforcement Learning with Verifiable Rewards)
RLVR is reinforcement learning where the reward comes from checks that can be verified, such as tests, exact answers, or formal rubrics, rather than only human preference judgments. In the AI agent glossary 2026, RLVR is increasingly relevant for coding and reasoning systems because correctness can often be measured automatically. Example: a coding model gets reward when its generated patch passes the unit tests. See also: RLHF, DPO, HumanEval, SWE-Bench.SFT (Supervised Fine-Tuning)
SFT is training on labeled prompt-response pairs so a base model learns the desired instruction-following behavior. It is often the first post-training step before preference optimization or reinforcement learning. Example: a model is trained on curated examples of user requests and ideal assistant answers to become a chat assistant. See also: Pre-training, RLHF, DPO, LoRA.When should builders care about RLVR instead of RLHF?
RLVR matters when success can be checked automatically. Coding tasks, math problems, and structured extraction pipelines often have tests, exact targets, or deterministic validators, which makes verifiable reward a better fit than preference-only ranking.
Why did LoRA become the default adaptation term?
LoRA reduced the cost of customization by avoiding full-model weight updates. That made it practical to adapt open models for narrow domains, internal jargon, or workflow-specific assistants without the compute budget of full fine-tuning.
Deployment and serving
PagedAttention
PagedAttention is a memory-management technique introduced by the vLLM project to improve LLM serving efficiency, especially around key-value cache handling. It is associated with higher throughput and better memory utilization in multi-request inference servers. Example: a self-hosted chat service uses vLLM and PagedAttention to serve more concurrent sessions on the same GPUs. See also: Continuous batching, Inference, Quantization, Tensor parallelism.
Quantization
Quantization reduces the numerical precision of model weights or activations, such as moving from FP16 to INT8 or INT4, to lower memory use and often improve inference efficiency. The tradeoff is that aggressive quantization can reduce quality or stability. Example: an edge deployment runs a smaller quantized model because the full-precision version does not fit in available VRAM. See also: QLoRA, Inference, Tensor parallelism, PagedAttention.Speculative decoding
Speculative decoding is a serving technique where a smaller draft model proposes tokens and a larger target model verifies them, reducing latency when the draft is often correct. It is a systems optimization rather than a training method. Example: a provider speeds up chat responses by letting a small model guess several next tokens before the larger model confirms them. See also: Streaming, Inference, Continuous batching, Quantization.Tensor parallelism
Tensor parallelism splits model computation across multiple GPUs, typically along tensor dimensions, so larger models can run when they do not fit on one device. It is a standard serving strategy in distributed inference stacks. Example: a 70B model is partitioned across several GPUs so one request can be processed cooperatively. See also: Inference, Quantization, PagedAttention, Continuous batching.Agent patterns
AP2 (Agent Payments Protocol)
AP2 refers to Agent Payments Protocol, a term used for delegated payment flows by software agents. The idea is that an agent can be authorized to initiate or coordinate a payment action under defined rules rather than holding unrestricted spending power. Example: a travel agent can be allowed to book a hotel up to a preset budget after user approval. See also: Tool use, MCP, Multi-agent, System prompt.
CoT (Chain of Thought)
Chain of Thought is a prompting or training pattern where a model is encouraged to reason through intermediate steps before producing an answer. The term is widely used in research and product discussions, though many production systems avoid exposing full reasoning traces to end users. Example: a math assistant performs intermediate calculations before returning the final result. See also: ReAct, Reflection, Temperature, Pass@k.MCP (Model Context Protocol)
MCP is an open protocol that standardizes how applications provide context and tools to language models. In the AI agent glossary 2026, MCP is one of the most important interoperability terms because it gives clients and servers a shared way to expose resources, prompts, and tools. Example: a desktop assistant connects to an MCP server to access a local codebase or company knowledge source through a standard interface. See also: Tool use, Function calling, RAG, Multi-agent.Multi-agent
Multi-agent describes systems where more than one agent participates in a workflow, often with roles such as planner, specialist, reviewer, or supervisor. Coordination can be hierarchical, peer-to-peer, or swarm-like. Example: one agent writes code, another runs tests, and a third reviews the patch before submission. See also: Agent loop, Reflection, ReAct, MCP.ReAct
ReAct is a prompting pattern that interleaves reasoning and acting, so the model alternates between thinking about the problem and using tools to gather information or make progress. It is one of the clearest templates for agentic behavior. Example: an assistant reasons about what it needs, calls a search tool, reads the result, and then decides the next action. See also: Agent loop, Tool use, Reflection, CoT.Reflection
Reflection is the pattern where an agent critiques its own output, checks for errors, and tries again before finalizing. It can improve reliability when paired with tests, validators, or explicit review prompts. Example: a code agent writes a patch, notices a failing test, and revises the implementation before returning it. See also: ReAct, Multi-agent, RLVR, Agent loop.What does MCP standardize for model-connected apps?
The MCP site describes a shared protocol for exposing context and capabilities to models. In practice, that means clients and servers can agree on how to surface tools, resources, and prompts without every integration inventing its own custom wiring.
Evaluation
HumanEval
HumanEval is an OpenAI coding benchmark that measures whether a model can generate code that passes unit tests for programming tasks. It is often reported with pass@k metrics rather than a single deterministic score. Example: a model may solve a problem on its second sampled attempt even if the first attempt fails. See also: Pass@1, Pass@k, SWE-Bench, RLVR.
MMLU
MMLU stands for Massive Multitask Language Understanding, a benchmark covering many academic and professional subjects. It is used to compare broad knowledge and reasoning performance across models. Example: a model’s MMLU score can indicate stronger general question-answering ability, even if it says little about tool use. See also: HumanEval, SWE-Bench, Pass@1, Pre-training.Pass@1, Pass@k
Pass@1 measures whether a model succeeds on the first attempt, while pass@k measures whether at least one of the first k attempts succeeds. These metrics are common in coding and reasoning benchmarks where multiple samples can be generated. Example: a coding model with modest pass@1 but stronger pass@5 may benefit from retry loops or candidate selection. See also: HumanEval, SWE-Bench, Temperature, Reflection.SWE-Bench
SWE-Bench is a benchmark for software engineering tasks based on real GitHub issues and repositories. It measures whether a model or agent can produce code changes that resolve the issue and pass tests. Example: an agent that can search a repository, edit files, and run tests is better aligned with SWE-Bench than a plain chat model. See also: HumanEval, Pass@k, RLVR, Agent loop.RAG and retrieval
Chunking
Chunking is the process of splitting documents into smaller pieces so they can be embedded, indexed, and retrieved efficiently. The chunk size and overlap affect recall, precision, and context quality. Example: a long policy PDF may be split into section-sized chunks before being stored in a vector database. See also: RAG, Embeddings, Reranking, HNSW.
HNSW (Hierarchical Navigable Small World)
HNSW is an approximate nearest-neighbor indexing algorithm commonly used in vector search systems. It is designed to make similarity search over embeddings fast enough for production retrieval workloads. Example: a vector database uses HNSW to quickly find the most similar document chunks for a user query. See also: Embeddings, RAG, Chunking, Reranking.RAG (Retrieval-Augmented Generation)
RAG is the pattern where a system retrieves relevant external documents and places them into the model’s context before generation. In the AI agent glossary 2026, RAG remains a foundational term because it is the standard way to ground answers in current or private data. Example: a support bot retrieves the latest product documentation and answers using those passages rather than relying only on model memory. See also: Chunking, Embeddings, Reranking, MCP.Reranking
Reranking is a second-stage retrieval step that rescoring candidate documents or chunks after the initial search. It is used to improve relevance before the final context is sent to the model. Example: a system retrieves 20 chunks by vector similarity, then reranks them so the top 5 most relevant passages go into the prompt. See also: RAG, Chunking, Embeddings, HNSW.Why do chunking and reranking matter so much in RAG?
Retrieval quality depends on what gets indexed and what gets selected for the final prompt. Poor chunk boundaries can hide relevant facts, while reranking helps surface the passages that best match the query after the first-pass vector search.
Infrastructure
Cold start
A cold start is the extra latency that occurs when a serverless or on-demand runtime has to initialize before serving a request. In agent systems, cold starts can affect tool latency as much as model latency. Example: the first invocation of a document parser may take much longer than later requests because the container is still booting. See also: Streaming, Inference, Continuous batching, Tool use.
Continuous batching
Continuous batching is a serving strategy that admits new requests into an active inference batch instead of waiting for a fixed batch to fill and finish. It improves GPU utilization and throughput in production LLM servers. Example: a chat API can keep adding incoming requests to the scheduler while earlier requests are still decoding. See also: PagedAttention, Inference, Streaming, Tensor parallelism.Inference
Inference is the act of running a trained model to produce outputs, as distinct from training or fine-tuning it. In the AI agent glossary 2026, inference is the umbrella term behind latency, throughput, memory use, and serving cost. Example: every time a user asks a coding assistant for a patch, the model is performing inference. See also: Quantization, Continuous batching, Tensor parallelism, Streaming.Frequently asked questions
What is the difference between tool use and function calling?
They usually refer to the same basic pattern: the model emits a structured request for an external capability instead of only returning text. OpenAI documents this in its platform docs at platform.openai.com/docs, while Anthropic documents tool-enabled workflows at docs.anthropic.com.
Why does MCP matter in agent infrastructure?
MCP matters because it is an open protocol for connecting models to context and tools in a standardized way. The official overview is at modelcontextprotocol.io, and Anthropic links MCP-related concepts from its developer documentation at docs.anthropic.com.
Where can I verify common LLM and transformer terms?
A good starting point is the Hugging Face glossary at huggingface.co/docs/transformers/glossary. For platform-specific terminology around prompts, tools, and responses, use the official docs from OpenAI and Anthropic.
Primary sources
- Anthropic documentation — Anthropic
- OpenAI platform documentation — OpenAI
- Hugging Face glossary — Hugging Face
- Model Context Protocol — Model Context Protocol
- LangGraph documentation — LangChain
- vLLM project — vLLM
- HumanEval paper page — arXiv
- MMLU paper page — arXiv
- SWE-Bench — SWE-Bench
Last updated: May 26, 2026. Related: Agent Infrastructure.