Transformer

Topic: Machine Learning · Last Updated: 2026-02-15

TL;DR

A transformer uses self-attention to model global dependencies in a sequence without recurrent state transitions.

Each token dynamically weighs all other relevant tokens to build contextual representation in one forward pass.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

In translation, target token generation can reference multiple source positions rather than only local hidden state memory.