Model Card: Spanish Text Generation with ByT5-Small

This model is a text generation model fine-tuned from ByT5-Small, designed to generate coherent and contextually relevant Spanish text based on input prompts. It is optimized for generating content chunks, making it suitable for applications such as content creation, automated writing assistance, and more.

Model Details

Model Name: trained-byt5-small
Architecture: ByT5-Small (a byte-level variant of T5)
Language: Spanish
Task: Text Generation
- Given a prompt, the model generates a textual response that continues or complements the input.

Intended Use and Applications

Content Creation: Assist writers by generating content based on given prompts, helping to overcome writer's block or to expand on ideas.
Automated Writing Assistance: Provide suggestions or continuations in writing applications, such as blogs, articles, or reports.
Chatbots and Conversational Agents: Enhance conversational AI systems by generating more natural and contextually appropriate responses in Spanish.
Educational Tools: Aid in language learning by generating example sentences, explanations, or extended content based on user inputs.
Creative Writing: Support creative processes by offering story continuations, character developments, or plot ideas.

How It Was Trained

1. Data Source

Database: Data was sourced from an internal SQL Server database containing:
- Prompts (input_text): User queries or initial text snippets.
- Content (output_text): Corresponding generated or relevant text passages with a high relevance rank (rank > 4).
Data Selection: The top 5,000 (prompt, content) pairs were selected where both prompt and content have non-zero lengths and a relevance rank greater than 4, ensuring high-quality training data.

2. Preprocessing

Text Splitting:
- Long output_text entries were split into chunks of up to 512 characters to manage model input size and to enhance training efficiency.
Tokenization:
- Utilized the ByT5Tokenizer for byte-level tokenization, which is well-suited for handling diverse Spanish text without being constrained by specific token vocabularies.
- Configured with:
  - max_length = 512
  - doc_stride = 256 (for handling long texts with overlapping contexts)

3. Training Setup

Base Model: google/byt5-small
Framework: PyTorch with Hugging Face Transformers
Loss Function: Cross Entropy Loss (torch.nn.CrossEntropyLoss) to train the model to predict the next tokens in the sequence.
Optimizer: AdamW with a learning rate of 5e-5 and weight decay of 0.01
Batch Size:
- Training: 2 per device
- Evaluation: 4 per device
Epochs: 3
Gradient Accumulation: 1 (simplified for stable training)
Mixed Precision: Disabled (fp16 = False) to prevent issues with NaNs during training.
Gradient Checkpointing: Enabled to optimize memory usage.
Early Stopping: Implemented with a patience of 2 epochs to prevent overfitting.
Hardware: Trained on GPUs if available; otherwise, CPU.

4. Data Splits

Training Set: 80% of the data
Validation Set: 20% split from the remaining 20%
Test Set: 50% of the validation split, resulting in:
- Training: 80%
- Validation: 10%
- Test: 10%

Model Performance

Training Metrics:
- Loss: Monitored using Cross Entropy Loss on both training and validation sets.
- Early Stopping: Training halted if the validation loss did not improve for 2 consecutive evaluation steps.
Final Evaluation:
- Test Set Loss: Logged as test_loss in the training logs.
- Performance Notes: Specific numerical results depend on the data distribution and the training process. Users are encouraged to evaluate the model on their own datasets to gauge performance in their specific applications.

Usage Example

Below is a Python example demonstrating how to use the fine-tuned ByT5-Small model for text generation in Spanish. Ensure you have installed the necessary libraries (transformers, torch) and have the model saved in the ./trained-byt5-small directory.

import torch
from transformers import T5ForConditionalGeneration, ByT5Tokenizer

# Load the trained model and tokenizer
model_dir = "./trained-byt5-small"
tokenizer = ByT5Tokenizer.from_pretrained(model_dir)
model = T5ForConditionalGeneration.from_pretrained(model_dir)

# Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

model.eval()

prompt = "¿Cómo implementar un sistema solar en una escuela primaria?"

# Tokenize the input text
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    max_length=512,
    truncation=True
).to(device)

# Generate outputs
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=512,
        num_return_sequences=1,
        do_sample=True,
        temperature=0.5,
        top_k=2000,
        top_p=0.95,
        repetition_penalty=1.2,
        early_stopping=True
    )

# Decode and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated Text: {generated_text}")

Output:

Generated Text: Para implementar un sistema solar en una escuela primaria, se puede comenzar por educar a los estudiantes sobre los planetas y sus características. Luego, se pueden realizar actividades prácticas como construir maquetas del sistema solar, organizar excursiones a planetarios o utilizar software educativo interactivo. Además, es importante fomentar la curiosidad y el interés de los alumnos mediante proyectos de investigación y presentaciones sobre diferentes aspectos del espacio.

Limitations and Ethical Considerations

Bias and Fairness:
- The model's outputs are influenced by the training data. If the data contains biases, the model may inadvertently reproduce them. Users should be cautious and review generated content for fairness and neutrality.
Domain Specificity:
- Trained on specific prompt-content pairs from an internal database, the model may perform best within similar contexts. Its performance might degrade when applied to highly specialized or unfamiliar domains.
Quality and Reliability:
- While the model aims to generate coherent and relevant text, it does not verify factual accuracy. Users should validate the generated content, especially in critical applications.
Data Privacy:
- Ensure that any data used with this model complies with relevant privacy laws and regulations. The training data should not contain sensitive or personal information unless appropriate consent has been obtained.
Misuse Potential:
- Like any generative model, it can be used to create misleading or harmful content. Implement safeguards to prevent and mitigate misuse.

Intended Users

Developers building Spanish-language content generation tools.
Content Creators seeking automated assistance in generating written material.
Researchers exploring text generation and natural language processing in Spanish.
Educators developing tools for language learning and educational content creation.
Businesses integrating conversational agents or chatbots that generate Spanish text.

profelyndoncarlson
/

chile_edu_estan_HyDE