This reference distills the essential terminology of large language model (LLM) research & engineering into a single, accessible guide. From architectures and core components, to training strategies and evaluation benchmarks, it provides the precise definitions needed to navigate technical papers, model documentation and benchmark results.
If you work with large language models (play with the latest open-source ones here), you know that the jargon can quickly get dense. This guide consolidates the most important terms — from architectures and core components to training strategies and evaluation benchmarks — in one place, with precise definitions that you can rely on when reading research papers, model documentation or benchmark results.
Model Architectures & Types
- Transformer: Neural architecture based entirely on attention mechanisms, discarding recurrent (RNN) and convolutional (CNN) networks. Delivers strong performance in tasks such as machine translation, with higher parallelism and shorter training times.
- Encoder-Decoder Architecture: Standard sequence transduction structure. The encoder processes the input sequence, and the decoder generates the output. In the highest-performing models, the two are connected via an attention mechanism.
- Decoder-Only Architecture: Transformer design using only the decoder stack, with causal self-attention restricting tokens to attend only to previous positions. Optimized for autoregressive generation (e.g., text completion, code and dialogue). Simpler and more efficient than encoder-decoder architecture for generative tasks; used in most modern LLMs such as GPT, LLaMA and Qwen.
- Bidirectional Encoder Representations from Transformers (BERT): Bidirectional Transformer encoder pre-trained with masked language modelling (MLM) and next sentence prediction (NSP). Jointly attends to left and right contexts. Common sizes are BERT-BASE (110M parameters) and BERT-LARGE (340M).
- Masked Language Model (MLM): Pre-training objective masking ~15% of WordPiece tokens — 80% replaced with [MASK], 10% with random tokens and 10% unchanged. The model predicts originals from both the left and right contexts.
- Next Sentence Prediction (NSP): Pre-training objective predicting whether the second sentence follows the first (50% true, 50% random).
- OpenAI GPT: Left-to-right Transformer using causal self-attention, where tokens attend only to preceding context.
- Embeddings from Language Models (ELMo): Concatenates features from independently trained left-to-right and right-to-left long short-term memory (LSTMs).
- DistilBERT: Compressed BERT retaining ~97% of its performance, with reduced size, cost and latency.
- Enhanced Language Representation with Informative Entities (ERNIE): Incorporates entity information from knowledge graphs into language representations.
- Efficient Scaling of Language Models with Mixture of Experts (GLaM): MoE-based architecture outperforming GPT-3 in NLU/NLG, with lower FLOPs per token and reduced training energy.
- Mixture of Experts (MoE): Uses a gating module to dynamically select a small subset of experts (e.g., 2 of 64) per token; outputs are combined before the next layer.
- Pathways Language Model (PaLM): Model series with sizes from 8B to 540B parameters, trained on 780B high-quality tokens.
- phi-1/phi-1-base/phi-1-small: Code generation models differing in API usage accuracy and logical consistency.
- Toolformer: Learns to use external tools such as WikiSearch or machine translation APIs.
- WebGPT: Browser-augmented question answering model that uses human feedback.
- InstructGPT: Fine-tuned with human feedback to better follow instructions.
Model Components & Mechanisms
- Attention Mechanism: Maps a query (Q) and key-value pairs (K, V) to an output as the weighted sum of values, with weights determined by a compatibility function.
- Query (Q), Key (K), Value (V): Vector components in the attention function.
- Scaled Dot-Product Attention: Computed as softmax(QKᵀ / √dₖ)V; adds scaling for stability.
- Additive Attention: Uses a feed-forward network to compute compatibility scores; is less efficient than dot-product attention in practice.
- Self-Attention: Each token attends to all others in the sequence, capturing long-range dependencies.
- Attention Heads: Independent attention layers in multi-head attention, each focusing on different dependencies.
- Transformer Blocks: The repeated layer units in Transformer-based models such as BERT.
- Layers (L), Hidden Size (H), Self-Attention Heads (A): Core parameters defining a Transformer architecture.
- WordPiece Embeddings: Subword tokenization used in BERT, with a 30K-token vocabulary.
- [CLS] Token: Special classification token placed at the sequence start; its final hidden state represents the sequence for classification.
- [SEP] Token: Separator token distinguishing sentences or segments.
- Token, Segment and Position Embeddings: Elements combined to form token input representations.
- Rotary Position Embedding (RoPE): Encodes positional information via rotation, preserving vector norms.
- Rotary Matrix: Predefined matrix in RoPE for applying rotations.
- Linear Attention: Self-attention variant with linear complexity; compatible with RoPE.
- Content-Based Key Vectors: Weight matrices for computing content-based keys.
- Location-Based Key Vectors: Weight matrices for computing location-based keys.
- Bitwise Determinism: Ensures exact reproducibility at the bit level from any checkpoint.
- Outliers: Extreme values that reduce quantization precision; mitigated with block constants.
- Vector-Wise Quantization: Improves quantization by applying scaling per vector.
- NormalFloat 4-bit (NF4): Symmetric 4-bit datatype optimized for normally distributed weights.
- FP8: 8-bit floating-point datatype for scaling constants.
- Blocksize: Block size in quantization, affecting precision and memory.
- Trainable Matrices: Low-rank matrices updated in low-rank adaptation (LoRA) fine-tuning.
- Frozen Weights: Pre-trained weights kept fixed during tuning.
- dmodel: Transformer layer input/output dimension.
- Low-Rank Adaptation (LoRA): A fine-tuning method that injects trainable low-rank matrices into pre-trained models while keeping original weights frozen.
- Rank (r): Rank of LoRA matrices.
- dffn: Dimension of Transformer multilayer perceptron (MLP), usually 4 × dmodel.
- Subspace Similarity: Measures the overlap between ∆W and W subspaces.
- Segment-Level Recurrence Mechanism: Transformer-XL technique reusing hidden states to extend context.
- Relative Positional Encodings: Encode relative distances into attention scores.
- Bi-Encoder Architecture: Uses separate encoders for queries and documents.
- Maximum Inner Product Search (MIPS): Retrieves top-k documents with the highest similarity scores.
- Non-Parametric Memory: Document index in retrieval-augmented models.
Training Methods & Strategies
- Pre-Training: Initial training on unlabeled data using objectives such as MLM and NSP.
- Fine-Tuning: Adapting a pre-trained model to a specific task with labeled data.
- Semi-Supervised Approach: Combines unsupervised pre-training and supervised fine-tuning for transferable representations.
- Two-Stage Training Procedure: Unsupervised pre-training followed by supervised adaptation.
- Unsupervised Objective: Pre-training target that does not require labels.
- Denoising Objective: Reconstructs corrupted inputs.
- Span-Corruption Objective: Masks contiguous token spans, predicting the originals.
- Mass-Style Objective: Masks 15% of tokens, replacing them with mask tokens, then reconstructs.
- BERT-Style Denoising Objective: Similar to MLM but used in encoder–decoder models to reconstruct full sequences.
- Deshuffling Objective: Predicts the original order from shuffled tokens.
- Multi-Task Training: Trains on multiple tasks concurrently.
- Multi-Task Pre-Training: Pre-trains across multiple tasks.
- Instruction Tuning: Fine-tunes on datasets reformatted as natural language instructions.
- Prompt Engineering: Optimizing prompts for desired outputs.
- Low-Rank Adaptation (LoRA): Parameter-efficient fine-tuning with low-rank matrices merged into frozen weights.
- Parameter-Efficient Approach: Fine-tuning with fewer trainable parameters.
- Random Gaussian Initialization: LoRA matrix A initialization method.
- Zero Initialization: LoRA matrix B initialization producing zero ∆W.
- Bias-Only/BitFit: Tunes only bias terms.
- Prefix-Embedding Tuning: Inserts special tokens with trainable embeddings.
- Prefix-Layer Tuning: Extends prefix tuning to layer activations.
- Adapter Tuning: Inserts adapter layers between attention/MLP modules and residuals.
- Prefix-Tuning: Optimizes continuous task-specific prefix vectors without changing model weights.
- Continuous Task-Specific Vectors: Learnable prefix parameters not tied to real tokens.
- Virtual Tokens: Prefix vectors treated as pseudo-tokens.
- Random Initialization: Randomly initializing prefixes; less effective than real-token initialization.
- Quantized LoRA (QLoRA): LoRA using NF4 quantization with FP8 constants and double dequantization.
- Double Dequantization: Converts 4-bit weights to higher-precision formats during inference.
- Reinforcement Learning from Human Feedback (RLHF): Aligns models using human preference data.
- Reinforcement Learning from AI Feedback (RLAIF): Uses AI-generated preference labels for alignment.
- AI-Generated Preference Labels: AI-produced quality judgements for candidate outputs.
- Reward Model (RM): Predicts reward signals for RLHF/RLAIF.
- Self-Consistency: Samples multiple reasoning chains (CoT) and averages preferences.
- Proximal Policy Optimization (PPO): RL algorithm used in RLHF.
- Supervised Fine-Tuning (SFT): Fine-tunes on curated supervised data.
- Process Supervision: Labels intermediate reasoning steps.
- Outcome Supervision: Labels only the final results.
- MathMix: Math-focused token dataset for pre-training.
- Decontamination Checks: Ensure that there is no benchmark leakage in training data.
- Weak-to-Strong Generalization: Trains a strong model under weak supervision.
- Weak Supervisor: Model producing weak labels.
- Weak Labels: Soft labels from weak supervision.
- Strong Student: Model trained under weak supervision that surpasses the supervisor.
- Imitation: Failure mode where the student copies supervisor errors.
- Human Simulator Failure Mode: Risk of models imitating human phrasing instead of optimal answers.
- Linear Probing: Adds a linear classifier to frozen models.
- Covariate Shift Problem: Training/test distribution mismatch.
- Concept Shift: Change in label meaning or distribution.
- Noisy Labels: Incorrect labels in data.
- FLOPs Per Token: Inference compute cost measure.
- Greedy Decoder: Decoding strategy selecting the highest-probability token in each step.
- AdamW Optimizer: Optimizer for models such as LLaMA and LoRA.
- Cosine Learning Rate Schedule: Cosine-shaped learning rate decay.
- Weight Decay: Regularization reducing overfitting.
- Gradient Clipping: Caps gradient magnitude.
- Warmup Steps: Gradually increase learning rate at the start of the training.
- Batch Size: Number of samples per training step.
- Epoch: One full pass over the training dataset.
- Learning Rate: Step size for weight updates.
- Reward Function (πrf): Source of training rewards in alignment.
- Gradient Coefficient (GC): Scales penalty/reward magnitude.
- Reward-Free Tuning (RFT): Alignment method without explicit reward models.
- Direct Preference Optimization (DPO): Preference-based alignment without reinforcement learning.
- Online RFT: Real-time reward-free tuning.
- Generalized Regularized Policy Optimization (GRPO): Alignment method using model-based reward functions.
- Continual Training: Continues training for domain adaptation.
Evaluation Metrics & Datasets
- BLEU: Translation quality metric, also used in code generation.
- Pass@k: Fraction of generated code passing tests within k attempts.
- GLUE Benchmark: NLU benchmark with tasks such as CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST-2, STS-B and WNLI.
- SQuAD: Reading comprehension dataset for QA.
- SICK Dataset: NLI dataset with entailment, contradiction and neutral labels.
- Open Entity: Entity classification benchmark.
- FewRel, TACRED: Relation classification datasets.
- HumanEval: Code generation benchmark.
- SuperGLUE: More challenging NLU benchmark.
- WMT Language Pairs: Translation datasets for BLEU scoring.
- MMLU: Multi-domain knowledge understanding benchmark.
- GSM8K: Grade-school maths QA benchmark.
- MATH Dataset: Advanced mathematics problem set.
- PRM800K: Large dataset with step-level labels for math problem solving.
- RealToxicityPrompts: Toxicity evaluation dataset.
- CrowS Pairs: Measures bias in models.
- TruthfulQA: Tests factual accuracy and informativeness.
- CAIL2019-SCM: Chinese long-text semantic matching dataset.
- HotpotQA, Fever: QA datasets for multi-hop reasoning.
- MS-MARCO, Jeopardy Question Generation: Used for retrieval-augmented generation.
- CMATH, AGIEval: Chinese mathematics benchmarks.
- Rouge-1/Rouge-2/Rouge-L: Summarization evaluation metrics.
- Exact Match (EM): QA accuracy measure.
- F1 Score: Common classification/NER metric.
- Accuracy: Overall correctness measure.
- Precision, Recall, Micro-F1: Entity and relation extraction metrics.
Other Key Concepts
- Universal Representation: Features transferable to multiple tasks.
- Low-Data Regime: Training with very limited data.
- Length Generalization: Performance on sequences longer than training examples.
- Co-Occurrence Prompts: Analyses token co-occurrence patterns in generated text.
- Temperature: Sampling parameter controlling randomness.
- Top-k Sampling: Chooses from top-k tokens at each step.
- POS Tagger: Identifies parts of speech for tokens.
- Toxicity Probability of the Prompt (TPP): Measures input prompt toxicity.
- Toxicity Probability of the Continuation (TPC): Measures toxicity in model outputs.
- Perspective API: Tool for assigning toxicity probabilities to text.
- Toxicity Degeneration: Unwanted toxic text generation.
- In-Context Learning: Learning from examples in the prompt without weight updates.
- Data Contamination: Evaluation data appearing in training sets.
- Calibration Curve: Plots-predicted confidence vs. actual accuracy.
- Substring Match: Detects overlap between evaluation and training data.
- LLM Prompt: Instruction or example text for LLMs.
- Self-Reflection Iterations: Iteration count for reflection-based entity detection.
- Chunk Size: Affects detected entity count.
- Entity Extraction: Identifies named entities and attributes.
- Relationship Extraction: Identifies relationships between entities.
- Leiden Algorithm: Detects communities in graph data.
- Entity Nodes: Graph nodes representing entities.
- Graph Communities: Groups of related entities in a graph.
- Hierarchical Clustering: Reveals internal community structure.
- RLAIF vs. RLHF: Compares AI-feedback and human-feedback reinforcement learning.
- Position Bias: Preference for specific positions in pairwise comparisons.
- Pairwise Accuracy: Reward model accuracy on held-out human preferences.
- ULMFiT: Universal fine-tuning method for text classification.
- Model Architectures: Fundamental model design types (e.g., Transformer and MoE).
- Reasoning Capability: Ability to perform logical reasoning and problem solving.
- External Knowledge: Use of information outside training data.
- High-Quality Data: Critical for improving performance and alignment.
- Human/AI Feedback: Mechanism for improving performance and alignment.

