Transformer

Topic: Machine Learning ยท Last Updated: 2026-02-15

TL;DR

A transformer uses self-attention to model global dependencies in a sequence without recurrent state transitions.

Intuition

Each token dynamically weighs all other relevant tokens to build contextual representation in one forward pass.

Formal

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Example

In translation, target token generation can reference multiple source positions rather than only local hidden state memory.

Pitfalls

  • Attention weights are not direct causal explanations.
  • Compute and memory scale quadratically with sequence length.