BERT-base Transformer forward pass

You can downlaod a pdf version of the following text by clicking here.   Initialize   $W_T \in \mathbb{R}^{\text{vocab size} \times d} = \mathbb{R}^{\text{vocab size} \times 768} … \text{token embeddings}$ $W_P \in \mathbb{R}^{\text{max input length} \times d} = \mathbb{R}^{512 \times 768} … \text{positional embeddings}$ $h \in \{1,…,n_{heads}\}, n_{heads}=12$ $l \in \{1,…,n_{layers}\}, n_{layers}=12$ $W_{h,l}^Q \in \mathbb{R}^{d \times d_q} = \mathbb{R}^{768 \times 64} … \text{query weight matrices}$ $W_{h,l}^K \in \mathbb{R}^{d \times d_k} = \mathbb{R}^{768 \times 64} … \text{key weight matrices}$

NLP’s generalization problem, and how researchers are tackling it

Check the post on the Gradient.