ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains | |
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same | |
number of (repeating) layers. |