A transformer uses self-attention to model global dependencies in a sequence without recurrent state transitions.
Transformer
Topic: Machine Learning ยท Last Updated: 2026-02-15
Intuition
Each token dynamically weighs all other relevant tokens to build contextual representation in one forward pass.
Formal
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Example
In translation, target token generation can reference multiple source positions rather than only local hidden state memory.
Pitfalls
- Attention weights are not direct causal explanations.
- Compute and memory scale quadratically with sequence length.