0 - Introduction

0.1 - Problems w/ RNN

  1. Slow computation for long sequences
  2. Vanishing / exploding gradients: multiplication of very small or large numbers is problematic (and precision issues)
  3. Difficulty in accessing info from long time ago

Untitled

0.2 - Input Embedding

Original words are converted into tokens with an input ID that identifies their position in the vocabulary, then turned into an embedding vector (eg. of size 512)

Untitled

0.3 - Positional Encoding

Goal: each word to carry some info about its position in the sentence, such as words can be treated as “close” or “distant” to each other.

Why Trig functions (sin vs. cos)?

Trig. functions naturally represent a pattern that the model can recognize as continuous → relative positions are easier to see for the model

Untitled

1 - Encoder

1.1 - Multi-head Attention

Self-Attention: allows model to relate words to each other. The output attention matrix captures the meaning (embedding), position (positional encodings), and each word’s interaction with other words.

$$ \text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$