Original words are converted into tokens with an input ID that identifies their position in the vocabulary, then turned into an embedding vector (eg. of size 512)
Goal: each word to carry some info about its position in the sentence, such as words can be treated as “close” or “distant” to each other.
Why Trig functions (sin vs. cos)?
Trig. functions naturally represent a pattern that the model can recognize as continuous → relative positions are easier to see for the model
Self-Attention: allows model to relate words to each other. The output attention matrix captures the meaning (embedding), position (positional encodings), and each word’s interaction with other words.
$$ \text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$