CATIE-AQ
/

FAT5-small

@@ -39,7 +39,7 @@ license: apache-2.0
 ## Introduction
 FAT5 (for Flash Attention T5) is an implementation of T5 in PyTorch with an [UL2](https://arxiv.org/abs/2205.05131) objective optimized for GPGPU for both training and inference. It uses an experimental feature for using [Flash Attention](https://arxiv.org/abs/2307.08691) (v2) with relative position encoding biases that allow to train or finetune the model on longer sequence lengths than the original T5. It also has support for other positional embeddings such as [RoPE](https://arxiv.org/abs/2104.09864), [ALiBi](https://arxiv.org/abs/2108.12409) or [FIRE](https://arxiv.org/abs/2310.04418).
-This methodology enabled us to efficiently pretrain as a proof of concept a T5 with 147M parameters in French in a reasonable time (1,461H to see 419B tokens) and with limited resources (1 single A100; i.e. a computational budget of around €2,200) which you'll find the weights in this repo.
 To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5, and to provide linear inference, thus extending the context size that can be taken into account by the model.
 Other optimizations have also been implemented, as detailed in a subsequent [blog post](https://huggingface.co/spaces/CATIE-AQ/FAT5-report).
@@ -143,11 +143,14 @@ Note that all benchmark graphs can be generated automatically using the followin
 ## Applications
 ### To French
-We've pretrained a small (147M parameters) FAT5-UL2 in French. This is the model you'll find in this Hugging Face repo.
 The dataset we used is a mixture of [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), [justice_fr](https://huggingface.co/datasets/eckendoerffer/justice_fr) and [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
 Our tokenizer of size 32,768 (8**5) is trained on CulturaX and The Stack.
-Our model is pre-trained on a sequence of 1,024 tokens on a single A100 for 1,461H (= 419.4B tokens seen) at an estimated cost of €2,200.
 Carbon emissions for the pretraining of this model were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact:
 • Hardware Type: A100 PCIe 40/80GB
 • Hours used: 1,461h
@@ -163,7 +166,7 @@ Our contribution focuses on French, with the pre-training and finetuning of mode
 Nevertheless, to ensure that it can be used in other languages, we have developed a [code](https://github.com/catie-aq/flashT5/blob/main/convert_huggingface_t5.py) for adapting already pre-trained (m)T5/FLAN-T5 weights to our method. In this way, we hope users of a specific language will be able to efficiently continue pre-training one of these models to adapt it to more recent data, for example.
 Note, however, that this adaptation is limited, since the additional pre-training will have to be carried out within the precision of the original model. For example, if the model's weights are in FP32 (which is the case with the FLAN-T5), training will not be as fast as with the FAT5, which is in BF16.
-For English speakers, we have already adapted the weights of the various versions of the [FLANT-T5](https://arxiv.org/abs/2210.11416) to our method. All weights can be found in this Hugging Face [collection](https://huggingface.co/collections/CATIE-AQ/catie-english-fat5-flan-662b679a8e855c7c0137d69e).
 To use one of the models, simply do the command:
 ```

 ## Introduction
 FAT5 (for Flash Attention T5) is an implementation of T5 in PyTorch with an [UL2](https://arxiv.org/abs/2205.05131) objective optimized for GPGPU for both training and inference. It uses an experimental feature for using [Flash Attention](https://arxiv.org/abs/2307.08691) (v2) with relative position encoding biases that allow to train or finetune the model on longer sequence lengths than the original T5. It also has support for other positional embeddings such as [RoPE](https://arxiv.org/abs/2104.09864), [ALiBi](https://arxiv.org/abs/2108.12409) or [FIRE](https://arxiv.org/abs/2310.04418).
+This methodology enabled us to efficiently pretrain as a proof of concept a T5 with 147M parameters in French in a reasonable time (1,461H to see 419B tokens) and with limited resources (1 single A100; i.e. a computational budget of around €1,900) which you'll find the weights in this repo.
 To achieve this, we designed CUDA/Triton kernels to make Flash Attention compatible with T5, and to provide linear inference, thus extending the context size that can be taken into account by the model.
 Other optimizations have also been implemented, as detailed in a subsequent [blog post](https://huggingface.co/spaces/CATIE-AQ/FAT5-report).
 ## Applications
 ### To French
+We've pretrained a small (147M parameters) FAT5-UL2 in French as a proof of concept. This is the model you'll find in this Hugging Face repo.
 The dataset we used is a mixture of [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), [justice_fr](https://huggingface.co/datasets/eckendoerffer/justice_fr) and [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
 Our tokenizer of size 32,768 (8**5) is trained on CulturaX and The Stack.
+Our model is pre-trained on a sequence of 1,024 tokens on a single A100 for 1,461H (= 419.4B tokens seen) at an estimated cost of €1,900.
+Not very useful in practice, it's a PoC and not an instructed model.
+If you still want to use it, make sure you have the necessary [requirements](https://github.com/catie-aq/flashT5/blob/main/requirements.txt) installed, as the model is not integrated into `transformers` and must be used via the `trust_remote_code=True=True` argument, as it is a custom model.
 Carbon emissions for the pretraining of this model were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact:
 • Hardware Type: A100 PCIe 40/80GB
 • Hours used: 1,461h
 Nevertheless, to ensure that it can be used in other languages, we have developed a [code](https://github.com/catie-aq/flashT5/blob/main/convert_huggingface_t5.py) for adapting already pre-trained (m)T5/FLAN-T5 weights to our method. In this way, we hope users of a specific language will be able to efficiently continue pre-training one of these models to adapt it to more recent data, for example.
 Note, however, that this adaptation is limited, since the additional pre-training will have to be carried out within the precision of the original model. For example, if the model's weights are in FP32 (which is the case with the FLAN-T5), training will not be as fast as with the FAT5, which is in BF16.
+For English speakers, we have already adapted the weights of the various versions of the [FLAN-T5](https://arxiv.org/abs/2210.11416) to our method. All weights can be found in this Hugging Face [collection](https://huggingface.co/collections/CATIE-AQ/catie-english-fat5-flan-662b679a8e855c7c0137d69e).
 To use one of the models, simply do the command:
 ```