LLM Training Glossary 2026: 20 Terms Every Builder Should Know

Surya Koritala
25 Min Read

This LLM training glossary covers 20 essential terms every builder shipping language-model systems should know in 2026.

20 terms define most conversations about modern LLM training and post-training. This glossary covers the core vocabulary behind model development, optimization, alignment, and adaptation, from pre-training and attention to RLHF, DPO, LoRA, and QLoRA, with concise definitions, examples, and cross-links to AI Agent Glossary 2026 and Open-Weight Models for Agents 2026.

Activation

An activation is the output value produced by a neuron or layer after applying a transformation to its inputs. In transformer models, activations are intermediate values passed between layers during the forward pass, and they are also used during backpropagation to compute gradients. Activation memory often becomes a practical bottleneck in training large models because those intermediate tensors must be stored or recomputed.

Many architectures also use activation functions such as GELU or SiLU to introduce nonlinearity, which helps the model represent more complex patterns than a purely linear stack could learn.

Example: If a team enables activation checkpointing in a transformer training run, it is trading extra compute for lower memory use by recomputing some intermediate activations instead of storing all of them.

See also: Gradient, Loss function, Attention, Self-attention. Related reading: AI Agent Glossary 2026.

Illustration representing transformer model training concepts and terminology
Image: source page. Used under fair use.

📌 Why it matters. Activation memory is one reason training throughput, sequence length, and hardware choice are tightly linked in large-model training.

Attention

Attention is the mechanism that lets a model weigh different parts of its input when producing an output. In transformer architectures introduced in Attention Is All You Need, attention replaces recurrence with a learned weighting process that helps the model decide which tokens matter most for the current computation.

The mechanism is central to how LLMs model long-range dependencies in text, code, and multimodal inputs. Different implementation choices, such as multi-head attention and grouped-query attention, affect efficiency, memory use, and inference speed.

Example: When a coding model predicts the closing bracket of a function, attention helps it focus on earlier tokens that define the function structure and indentation context.

See also: Self-attention, Context window, Tokenization, Embedding. Related reading: Open-Weight Models for Agents 2026.

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.”

Vaswani et al., Attention Is All You Need

Batch size

Batch size is the number of training examples processed together before the model updates its parameters. In practice, teams distinguish between micro-batch size, per-device batch size, and effective global batch size, especially when using gradient accumulation or distributed training.

Batch size affects memory use, optimization dynamics, and throughput. Larger batches can improve hardware utilization, but they may also require learning-rate adjustments and can change convergence behavior.

Example: A training job might use a micro-batch of 4 sequences per GPU and gradient accumulation across steps to reach an effective batch size of 512 sequences.

See also: Learning rate, Gradient, Loss function, Parameter count. Related reading: AI Observability Stack 2026.

TermWhat it measuresTypical training implication
Batch sizeExamples processed before an updateAffects throughput, memory, and optimization behavior
Learning rateStep size for parameter updatesAffects stability and convergence speed
Loss functionTraining objective being minimizedDefines what counts as error
GradientDirection and magnitude of changeDrives updates during backpropagation
Four optimization terms that often appear together in LLM training runs.

Constitutional AI

Constitutional AI is an alignment method associated with Anthropic in which a model is trained to critique and revise its own outputs according to a set of written principles, or a constitution, rather than relying only on human preference labels. Anthropic described the approach in its 2022 paper Constitutional AI: Harmlessness from AI Feedback.

The method is designed to reduce harmful, biased, or policy-violating outputs while making the alignment process more explicit and scalable. It is often discussed alongside RLHF because both are post-training approaches aimed at shaping model behavior after pre-training.

Example: A model may generate an answer, critique it against a rule such as avoiding harmful instructions, and then produce a revised response that better follows that principle.

See also: RLHF, DPO, Supervised fine-tuning (SFT), Loss function. Related reading: AI Governance Checklist 2026.

⚠️ Scope note. Constitutional AI is an alignment method, not a substitute for evaluation, policy enforcement, or deployment controls.

Context window

The context window is the maximum amount of input and generated text a model can consider in a single inference or training sequence, usually measured in tokens. It sets a hard limit on how much conversation history, code, retrieved documentation, or other serialized input can fit into one pass.

A larger context window can improve tasks that depend on long documents or multi-step reasoning over many prior tokens, but it also raises compute and memory costs. The useful value of a long context depends on training, architecture, and retrieval strategy, not only on the headline token limit.

Example: If a model has a 128K-token context window, a developer can place a long codebase excerpt and instructions into one prompt, provided the total token count stays within that limit.

See also: Tokenization, Attention, Self-attention, Embedding. Related reading: Retrieval-Augmented Generation 2026.

Distillation

Distillation is the process of training a smaller model to imitate the behavior of a larger model, often called the teacher. The student model learns from teacher outputs, logits, or synthetic examples so that it can preserve useful capabilities while reducing inference cost, latency, or deployment footprint.

Distillation is widely used when teams want a model that is cheaper to serve or easier to run on constrained hardware. It can also be used to transfer behavior from a proprietary or larger open model into a smaller open-weight model, subject to licensing and data constraints.

Example: A team may use a larger instruction-tuned teacher to generate high-quality reasoning traces and then fine-tune a smaller student model on those outputs.

See also: Pre-training, Supervised fine-tuning (SFT), LoRA, Parameter count. Related reading: Open-Weight Models for Agents 2026.

DPO

DPO, short for Direct Preference Optimization, is a post-training method that learns from preference data without the separate reward-model stage used in standard RLHF pipelines. The method was introduced in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

In practice, DPO trains a model to prefer chosen responses over rejected ones using a supervised-style objective derived from preference comparisons. Teams often consider it because it can simplify alignment workflows while still using pairwise human or synthetic preference data.

Example: If annotators choose answer A over answer B for the same prompt, DPO updates the model so answer patterns like A become more likely than patterns like B.

See also: RLHF, Supervised fine-tuning (SFT), Constitutional AI, Loss function. Related reading: AI Agent Evals 2026.

Embedding

An embedding is a dense numerical representation of text, code, images, or other data in a vector space. In language models, token embeddings map discrete tokens into continuous vectors that the network can process, and separate embedding models are also used for retrieval, clustering, and semantic search.

Embeddings make similarity computations practical because related items tend to appear closer together in vector space. They are foundational to retrieval-augmented generation systems and many ranking pipelines used around LLMs.

Example: A documentation search system can embed both user queries and knowledge-base passages, then retrieve the passages with the highest vector similarity to the query.

See also: Tokenization, Attention, Context window, Pre-training. Related reading: Vector Databases for AI Agents 2026.

Gradient

A gradient is the derivative of the loss with respect to model parameters, showing how each parameter should change to reduce error. During backpropagation, the optimizer uses gradients to update weights in the direction expected to improve the model on the training objective.

Gradients can become unstable in deep networks if they explode or vanish, which is one reason training recipes include normalization, clipping, optimizer tuning, and careful learning-rate schedules. In distributed training, gradients are also synchronized across devices before updates are applied.

Example: If the model predicts the wrong next token with high confidence, the resulting gradient will push relevant parameters away from that mistaken prediction pattern.

See also: Loss function, Learning rate, Batch size, Activation. Related reading: AI Observability Stack 2026.

loss = model(input_ids=input_ids, labels=labels).loss
loss.backward()  # computes gradients for trainable parameters
optimizer.step()
optimizer.zero_grad()

Learning rate

Learning rate is the step size used when updating model parameters during optimization. It is one of the most sensitive hyperparameters in training because a value that is too high can destabilize learning, while a value that is too low can make convergence slow or ineffective.

Most large-model training runs use schedules rather than a constant value, such as warmup followed by decay. The right setting depends on optimizer choice, batch size, model scale, and the stage of training or fine-tuning.

Example: A fine-tuning run on an instruction dataset may use a smaller learning rate than pre-training to avoid overwriting useful capabilities learned from the base model.

See also: Batch size, Gradient, Loss function, Supervised fine-tuning (SFT).

LoRA

LoRA, short for Low-Rank Adaptation, is a parameter-efficient fine-tuning method introduced in the paper LoRA: Low-Rank Adaptation of Large Language Models. Instead of updating every model weight, LoRA freezes the base model and trains small low-rank matrices inserted into selected layers.

This reduces the number of trainable parameters and lowers memory requirements relative to full fine-tuning. It has become a standard method for adapting open-weight models to domain-specific tasks, styles, or instruction formats.

Example: A team can adapt a base coding model to an internal API style guide by training LoRA adapters on company-specific examples without modifying the full base checkpoint.

See also: QLoRA, Supervised fine-tuning (SFT), Distillation, Parameter count. Related reading: Open-Weight Models for Agents 2026.

Loss function

A loss function is the mathematical objective the model is trained to minimize. In language modeling, a common objective is next-token prediction loss, often implemented as cross-entropy between the model’s predicted token distribution and the true token.

Different post-training methods use different losses even when they start from the same base model. Preference optimization, ranking objectives, contrastive learning, and policy optimization all define error in different ways.

Example: During pre-training, if the true next token is ‘database’ and the model assigns it low probability, the cross-entropy loss increases and the model is updated to make that token more likely in similar contexts.

See also: Gradient, Learning rate, Batch size, DPO.

Mixture of Experts (MoE)

Mixture of Experts, or MoE, is an architecture in which only a subset of specialized feed-forward sub-networks, called experts, are activated for each token or input. A routing mechanism decides which experts to use, allowing the model to scale total parameter count without activating all parameters on every forward pass.

This can improve compute efficiency relative to a dense model of similar total size, though MoE systems introduce routing complexity, load-balancing challenges, and deployment tradeoffs. MoE designs are now common in frontier model architecture discussions and in some open-weight releases.

Example: In an MoE transformer, one token may be routed to two experts specialized for code-like patterns while another token in the same sequence is routed to different experts.

See also: Parameter count, Activation, Attention, Pre-training. Related reading: Open-Weight Models for Agents 2026.

📌 Common confusion. A model’s total parameter count in an MoE architecture is not the same thing as the number of parameters activated on each token.

Parameter count

Parameter count is the total number of learned weights in a model. It is often used as a rough proxy for model scale, but it does not by itself determine quality, latency, memory footprint, data efficiency, or downstream performance.

Architecture, training data, tokenizer design, optimization recipe, and post-training all influence capability. In MoE systems, published parameter counts can be especially easy to misread because only part of the model may be active for a given token.

Example: Two models with similar parameter counts can perform very differently if one was trained on more diverse data or received stronger post-training and evaluation.

See also: Mixture of Experts (MoE), Distillation, LoRA, Pre-training. Related reading: Open-Weight Models for Agents 2026.

Pre-training

Pre-training is the initial large-scale training phase in which a model learns general patterns from broad datasets, usually by predicting the next token in a sequence. This stage creates the base model that later fine-tuning and alignment methods build on.

For modern LLMs, pre-training typically consumes the largest share of data, compute, and time in the model-development lifecycle. Choices made here, including corpus composition, tokenizer design, architecture, and optimization schedule, strongly shape downstream capability.

Example: A base language model trained on web text, books, code, and reference material before any instruction tuning has undergone pre-training but not yet task-specific post-training.

See also: Tokenization, Embedding, Attention, Supervised fine-tuning (SFT). Related reading: Open-Weight Models for Agents 2026.

QLoRA

QLoRA is a fine-tuning method introduced in the paper QLoRA: Efficient Finetuning of Quantized LLMs that combines low-rank adapters with quantized base-model weights. The approach lets teams fine-tune large models with much lower memory use than full-precision full fine-tuning.

In the original method, the base model is quantized to 4-bit and LoRA adapters are trained on top. QLoRA became widely adopted because it made instruction tuning and domain adaptation practical on more modest hardware.

Example: A builder can fine-tune a 7B or larger open-weight model on a single machine by loading quantized weights and training only the adapter layers.

See also: LoRA, Supervised fine-tuning (SFT), Distillation, Parameter count. Related reading: Open-Weight Models for Agents 2026.

RLHF

RLHF stands for Reinforcement Learning from Human Feedback, a post-training approach in which human preference data is used to shape model behavior. OpenAI’s InstructGPT paper popularized the modern pipeline: supervised fine-tuning, reward-model training from ranked outputs, and reinforcement learning to optimize responses against that reward signal.

RLHF is used to improve helpfulness, harmlessness, and instruction following beyond what next-token pre-training alone provides. It is often compared with DPO because both methods use preference data but optimize it differently.

Example: Annotators rank several candidate answers to the same prompt, and those rankings are used to train a reward model that guides later policy optimization.

See also: DPO, Constitutional AI, Supervised fine-tuning (SFT), Loss function. Related reading: AI Governance Checklist 2026.

Self-attention

Self-attention is a form of attention in which tokens in a sequence attend to other tokens in that same sequence. It is the core operation that lets transformer models build contextual representations, because each token’s representation is updated based on its relationship to surrounding tokens.

The mechanism uses queries, keys, and values to compute attention weights across the sequence. Causal self-attention, used in autoregressive LLMs, prevents tokens from attending to future positions during next-token prediction.

Example: In the sentence ‘The server restarted because it overheated,’ self-attention helps the model connect ‘it’ to ‘server’ rather than to an unrelated earlier noun.

See also: Attention, Context window, Tokenization, Embedding.

Supervised fine-tuning (SFT)

Supervised fine-tuning, usually abbreviated SFT, is the process of training a pre-trained model on labeled prompt-response pairs or other task-specific examples. It is often the first post-training step used to make a base model follow instructions, adopt a format, or specialize in a domain.

SFT differs from pre-training because the data is narrower and the objective is tied to desired outputs rather than broad next-token learning over a massive corpus. It also commonly serves as the starting policy for later alignment methods such as RLHF or DPO.

Example: A company can SFT an open-weight base model on curated customer-support transcripts so the model learns the expected answer style and escalation format.

See also: Pre-training, RLHF, DPO, LoRA. Related reading: Open-Weight Models for Agents 2026.

Tokenization

Tokenization is the process of splitting raw text into smaller units, called tokens, that a model can process. Tokens are not always words; depending on the tokenizer, they may be whole words, subwords, punctuation marks, or byte-level fragments.

Tokenizer design affects vocabulary size, sequence length, efficiency, and multilingual behavior. Because context windows are measured in tokens rather than characters or words, tokenization directly affects prompt budgeting and training cost.

Example: The same paragraph may consume a different number of tokens under different tokenizers, which changes both inference cost and how much text fits into the model’s context window.

See also: Embedding, Context window, Pre-training, Self-attention. Related reading: AI Agent Glossary 2026.

Frequently asked questions

What is the difference between pre-training and fine-tuning?

Pre-training is the broad initial phase where a model learns general language patterns from large corpora, usually through next-token prediction, while fine-tuning adapts that base model to narrower tasks or behaviors. For formal definitions and implementation context, see Hugging Face’s Transformers documentation and the original InstructGPT paper.

Is DPO the same thing as RLHF?

No. Both use preference data, but standard RLHF typically includes a separate reward model and reinforcement-learning stage, while DPO optimizes preferences directly without that separate reward-model training step. The distinction is described in Direct Preference Optimization and contrasted with the RLHF pipeline in InstructGPT.

Why do LoRA and QLoRA matter for builders?

They reduce the memory and compute needed to adapt large models, which makes domain-specific tuning more practical on limited hardware. The original methods are documented in LoRA and QLoRA, and they are especially relevant when working with open-weight models.

Does a larger parameter count always mean a better model?

No. Parameter count is only one measure of scale, and performance also depends on architecture, data quality, tokenizer design, optimization, and post-training. The transformer foundation is described in Attention Is All You Need, while practical model selection depends on broader evaluation criteria than size alone.

Primary sources

Last updated: May 21, 2026. Related: Agent Infrastructure.

Share This Article
3 Comments