File size: 6,646 Bytes
c311681 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
---
license: apache-2.0
base_model:
- unsloth/SmolLM2-1.7B-Instruct
pipeline_tag: text-generation
tags:
- text-to-image-evaluation
- faithfulness
- lora
- tifa
- unsloth
- flexible-structure
language: en
---
# SmolLM2-1.7B-Instruct-TIFA-Random
## Model Description
SmolLM2-1.7B-Instruct-TIFA-Random is a fine-tuned version of [unsloth/SmolLM2-1.7B-Instruct](https://huggingface.co/unsloth/SmolLM2-1.7B-Instruct) specifically trained for **TIFA (Text-to-Image Faithfulness Assessment)** with flexible question generation. Unlike previous structured versions, this model generates diverse, natural evaluation questions without rigid formatting constraints, making it more adaptable for various evaluation scenarios.
**Model Series**: [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) | [1.7B-Structured](https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA) | **1.7B-Random**
## Key Innovation: Flexible Structure
This model represents a paradigm shift from rigid question structures to **flexible, natural question generation**:
- **Previous models**: Fixed Q1/Q2/Q3/Q4 structure with predetermined answer types
- **This model**: Dynamic question generation focusing on visual verification without structural constraints
- **Benefit**: More natural, diverse questions that better reflect real-world evaluation needs
## Intended Use
This model generates 4 visual verification questions for text-to-image evaluation, focusing on:
- **Colors, shapes, objects, materials** - Core visual elements
- **Spatial relationships** - Positioning and arrangement
- **Presence/absence verification** - What exists or doesn't exist
- **Mixed question types** - Both yes/no and multiple choice questions
- **Natural diversity** - Questions adapt to description content rather than following templates
## Model Details
- **Base Model**: unsloth/SmolLM2-1.7B-Instruct
- **Model Size**: 1.7B parameters
- **Fine-tuning Method**: Enhanced LoRA with flexible structure training
- **Training Framework**: Transformers + TRL + PEFT + Unsloth
- **License**: apache-2.0
## Training Details
### Advanced Training Configuration
- **Training Method**: Supervised Fine-Tuning with category-balanced validation
- **Enhanced LoRA Configuration**:
- r: 32
- lora_alpha: 64
- lora_dropout: 0.05
- Target modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
- **Optimized Training Parameters**:
- Epochs: 2
- Learning Rate: 5e-5
- Batch Size: 16
- Gradient Accumulation: 2 steps (effective batch size: 32)
- Max Sequence Length: 1024
- LR Scheduler: Cosine with 3% warmup
- Validation: Category-balanced evaluation every 250 steps
### Enhanced Dataset
- **Size**: 18,000 examples
- **Structure**: Flexible question generation without rigid templates
- **Validation**: Category-balanced split ensuring robust evaluation
- **Coverage**: Diverse visual elements, materials, spatial relationships, and verification tasks
## Usage
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto"
)
# Create pipeline
chat_pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
)
def get_message(description):
system = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.
Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain
Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]
Generate questions that test visual faithfulness between the description and image."""
user_msg = f'Create 4 visual verification questions for this description: "{description}"'
return [
{"role": "system", "content": system},
{"role": "user", "content": user_msg}
]
# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
messages = get_message(description)
output = chat_pipe(
messages,
max_new_tokens=256,
do_sample=False,
)
print(output[0]["generated_text"])
```
### Example Outputs
**For "a lighthouse overlooking the ocean":**
```
Q1: What type of structure is prominently featured?
C: windmill, lighthouse, tower, castle
A: lighthouse
Q2: What body of water is visible?
C: lake, river, ocean, pond
A: ocean
Q3: Is the lighthouse positioned above the water?
C: no, yes
A: yes
Q4: Are there any mountains in the scene?
C: no, yes
A: no
```
## Citation
```bibtex
@misc{smollm2-1-7b-it-tifa-random-2025,
title={SmolLM2-1.7B-Instruct-TIFA-Random: Flexible Question Generation for Text-to-Image Faithfulness Assessment},
author={kawchar85},
year={2025},
url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random}
}
```
## Model Series Comparison
| Model | Parameters | Dataset | Structure | Best For |
|-------|------------|---------|-----------|----------|
| [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | 135M | 5k | Fixed Q1-Q4 | Quick evaluation, resource-constrained |
| [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) | 360M | 10k | Fixed Q1-Q4 | Balanced performance |
| [1.7B](https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA) | 1.7B | 10k | Fixed Q1-Q4 | Structured evaluation |
| **1.7B-Random** | 1.7B | 18k | **Flexible** | **Research, natural evaluation** |
|