Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
857 Bytes
Axial positional encodings factorize \(X_{i,j}\) into two matrices:
$$X^{1}_{i,j}, \text{ with } i \in \left[1,\ldots, d^1\right] \text{ and } j \in \left[1,\ldots, n_s^1\right]$$
and
$$X^{2}_{i,j}, \text{ with } i \in \left[1,\ldots, d^2\right] \text{ and } j \in \left[1,\ldots, n_s^2\right]$$
with:
$$d = d^1 + d^2 \text{ and } n_s = n_s^1 \times n_s^2 .$$
Therefore the following holds:
$$X_{i,j} = \begin{cases}
X^{1}{i, k}, & \text{if }\ i < d^1 \text{ with } k = j \mod n_s^1 \
X^{2}{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
\end{cases}$$
Intuitively, this means that a position embedding vector \(x_j \in \mathbb{R}^{d}\) is now the composition of two
factorized embedding vectors: \(x^1_{k, l} + x^2_{l, k}\), where as the config.max_embedding_size dimension
\(j\) is factorized into \(k \text{ and } l\).