From Sequence Modeling to Scalable Language Systems
This series is structured intentionally. Part 1 establishes the minimum shared technical vocabulary required to reason accurately about modern large language models. Parts 2 through 4 will progressively shift from architectural mechanics to scaling theory, system design, and finally alignment, limits, and frontier research.
This first section is deliberately accessible, but not simplistic. It assumes familiarity with basic machine learning concepts and aims to remove common misconceptions that persist even among technically experienced audiences.
Language Modeling Before Transformers
At their core, large language models are probabilistic systems trained to predict the next token in a sequence. This framing predates modern deep learning and remains unchanged, even as architectures have evolved.
Early neural language models relied on recurrent neural networks (RNNs), which process sequences one element at a time while maintaining a hidden state intended to summarize prior context. Variants such as LSTMs and GRUs introduced gating mechanisms that significantly mitigated vanishing gradient issues and allowed models to retain information across longer spans.
By the mid-2010s, gradient instability was no longer the dominant limitation for sequence models. The real constraint was architectural.
Recurrent models are inherently sequential. Each token depends on the previous hidden state, which prevents parallel computation across time steps. This made training slow, inefficient on GPUs, and difficult to scale. Even when gradients flowed, hardware utilization did not.
This bottleneck, not broken learning dynamics, set the stage for the next architectural shift.
The Transformer's Actual Breakthrough
The Transformer architecture, introduced in Attention Is All You Need (Vaswani et al., 2017), did not primarily exist to fix gradient problems. Its defining innovation was the removal of recurrence entirely.
Instead of processing tokens sequentially, Transformers process entire sequences in parallel using self-attention. Every token attends to every other token in the sequence simultaneously, allowing the model to construct contextual representations without a temporal dependency chain.
This design aligned exceptionally well with GPU and TPU hardware. Training throughput increased dramatically. Scaling became practical. Model size could grow without linear slowdowns tied to sequence length.
Key Insight: Attention was not just a modeling improvement. It was a systems-level optimization.
What Self-Attention Actually Does
Self-attention computes relationships between tokens by projecting inputs into queries, keys, and values. Each token's representation becomes a weighted combination of all other tokens, with weights determined by similarity in embedding space.
This mechanism allows:
- Long-range dependencies to be modeled directly
- Context to be recomputed dynamically at each layer
- Representations to evolve hierarchically through depth
The cost is quadratic complexity. Both memory and compute scale as O(n²) with respect to sequence length, where n is the number of tokens. This trade-off is manageable for moderate contexts and becomes a central concern at scale, which later parts of the series will address.
Positional Information Is Still Required
Because Transformers remove recurrence, they have no inherent notion of token order. Positional information must be injected explicitly.
The original Transformer used fixed sinusoidal positional encodings, chosen for their ability to generalize to unseen sequence lengths. Many modern models still rely on sinusoidal or learned absolute embeddings.
Other approaches, such as Rotary Position Embeddings (RoPE) and ALiBi, encode relative position information directly into the attention mechanism. These methods offer advantages for long-context generalization but are not universally adopted.
There is no single standard. Positional encoding remains an active design choice with measurable downstream effects.
Tokenization: The Layer Everyone Forgets
Before any model processes text, that text is converted into tokens. Tokenization defines the model's atomic units of meaning and has far-reaching consequences.
Most modern LLMs use subword tokenization schemes such as:
- Byte Pair Encoding (BPE)
- WordPiece
- SentencePiece
These methods balance vocabulary size against sequence length. They enable open-vocabulary modeling but introduce artifacts. Token boundaries do not always align with semantic boundaries. Languages with rich morphology or non-Latin scripts behave differently under the same tokenizer.
Tokenization affects:
- Multilingual performance
- Prompt sensitivity
- Inference cost
- Error modes in reasoning tasks
Important: Despite this, tokenization is often treated as a preprocessing detail rather than a core design decision. That is a mistake.
Why Depth Matters More Than You Think
Transformers stack multiple attention and feedforward layers. Early layers tend to capture syntactic patterns. Middle layers encode semantic relationships. Deeper layers integrate task-specific abstractions.
This layered structure enables compositional behavior without explicit symbolic rules. It also explains why shallow models plateau quickly, while deeper models exhibit qualitatively different capabilities.
Depth is not just capacity. It is representational hierarchy.
Setting Expectations for the Rest of the Series
At this stage, it is important to be clear about what large language models are and are not.
They are not databases. They do not "store facts" in a retrievable way. They are statistical systems that compress patterns from data into parameter space. Their strengths and weaknesses follow directly from that reality.
- Part 1 establishes architectural fundamentals.
- Part 2 will examine scaling laws, optimization, and why size changes behavior.
- Part 3 will move beyond models into systems: retrieval, deployment, and distributed inference.
- Part 4 will confront alignment, safety, emergent capabilities, and hard limits.
Each part builds on the last. Skipping foundations leads to incorrect conclusions later.
Part 1 Glossary
Foundations and Core Architecture
Language Model
A probabilistic model that predicts the next token in a sequence based on prior tokens.
Token
The basic unit processed by an LLM. Tokens are usually subwords or character sequences produced by a
tokenizer, not full words.
Tokenization
The process of converting raw text into tokens. Common approaches include BPE, WordPiece, and SentencePiece.
Recurrent Neural Network (RNN)
A neural network architecture that processes sequences one element at a time while maintaining a hidden state.
LSTM / GRU
RNN variants that use gating mechanisms to preserve information over longer sequences and reduce vanishing
gradients.
Parallelization
The ability to process multiple inputs simultaneously. Transformers enable parallelization across tokens,
unlike RNNs.
Transformer
A neural architecture based on self-attention that processes entire sequences in parallel.
Self-Attention
A mechanism where each token attends to all other tokens in a sequence to build contextual representations.
Query / Key / Value (QKV)
Linear projections of token embeddings used to compute attention weights and contextualized outputs.
Attention Matrix
An intermediate structure representing how strongly each token attends to every other token.
Quadratic Complexity (O(n²))
Describes how attention compute and memory scale with sequence length.
Positional Encoding
A method for injecting token order information into Transformers.
Sinusoidal Positional Encoding
Fixed mathematical functions used to represent position information.
Rotary Position Embedding (RoPE)
A relative positional encoding method that rotates embeddings based on position.