File size: 6,646 Bytes
c311681
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: apache-2.0
base_model:
- unsloth/SmolLM2-1.7B-Instruct
pipeline_tag: text-generation
tags:
  - text-to-image-evaluation
  - faithfulness
  - lora
  - tifa
  - unsloth
  - flexible-structure
language: en
---

# SmolLM2-1.7B-Instruct-TIFA-Random

## Model Description

SmolLM2-1.7B-Instruct-TIFA-Random is a fine-tuned version of [unsloth/SmolLM2-1.7B-Instruct](https://huggingface.co/unsloth/SmolLM2-1.7B-Instruct) specifically trained for **TIFA (Text-to-Image Faithfulness Assessment)** with flexible question generation. Unlike previous structured versions, this model generates diverse, natural evaluation questions without rigid formatting constraints, making it more adaptable for various evaluation scenarios.

**Model Series**: [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) | [1.7B-Structured](https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA) | **1.7B-Random**

## Key Innovation: Flexible Structure

This model represents a paradigm shift from rigid question structures to **flexible, natural question generation**:
- **Previous models**: Fixed Q1/Q2/Q3/Q4 structure with predetermined answer types
- **This model**: Dynamic question generation focusing on visual verification without structural constraints
- **Benefit**: More natural, diverse questions that better reflect real-world evaluation needs

## Intended Use

This model generates 4 visual verification questions for text-to-image evaluation, focusing on:
- **Colors, shapes, objects, materials** - Core visual elements
- **Spatial relationships** - Positioning and arrangement  
- **Presence/absence verification** - What exists or doesn't exist
- **Mixed question types** - Both yes/no and multiple choice questions
- **Natural diversity** - Questions adapt to description content rather than following templates

## Model Details

- **Base Model**: unsloth/SmolLM2-1.7B-Instruct
- **Model Size**: 1.7B parameters
- **Fine-tuning Method**: Enhanced LoRA with flexible structure training
- **Training Framework**: Transformers + TRL + PEFT + Unsloth
- **License**: apache-2.0

## Training Details

### Advanced Training Configuration
- **Training Method**: Supervised Fine-Tuning with category-balanced validation
- **Enhanced LoRA Configuration**:
  - r: 32
  - lora_alpha: 64 
  - lora_dropout: 0.05
  - Target modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
  
- **Optimized Training Parameters**:
  - Epochs: 2 
  - Learning Rate: 5e-5
  - Batch Size: 16
  - Gradient Accumulation: 2 steps (effective batch size: 32)
  - Max Sequence Length: 1024
  - LR Scheduler: Cosine with 3% warmup
  - Validation: Category-balanced evaluation every 250 steps

### Enhanced Dataset
- **Size**: 18,000 examples
- **Structure**: Flexible question generation without rigid templates
- **Validation**: Category-balanced split ensuring robust evaluation
- **Coverage**: Diverse visual elements, materials, spatial relationships, and verification tasks

## Usage

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

# Create pipeline
chat_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
)

def get_message(description):
    system = """\
You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.

Guidelines:
- Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
- Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
- Each question should test a DIFFERENT aspect of the description
- Ensure questions can be answered by visual inspection of the image
- Use elements explicitly mentioned in the description
- Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
- Make distractors realistic and relevant to the domain

Format each question as:
Q[number]: [question text]
C: [comma-separated choices]
A: [correct answer]

Generate questions that test visual faithfulness between the description and image."""
    
    user_msg = f'Create 4 visual verification questions for this description: "{description}"'
    return [
        {"role": "system", "content": system},
        {"role": "user", "content": user_msg}
    ]

# Generate evaluation questions
description = "a lighthouse overlooking the ocean"
messages = get_message(description)

output = chat_pipe(
    messages, 
    max_new_tokens=256,
    do_sample=False,
)

print(output[0]["generated_text"])
```

### Example Outputs

**For "a lighthouse overlooking the ocean":**
```
Q1: What type of structure is prominently featured?
C: windmill, lighthouse, tower, castle
A: lighthouse

Q2: What body of water is visible?
C: lake, river, ocean, pond
A: ocean

Q3: Is the lighthouse positioned above the water?
C: no, yes
A: yes

Q4: Are there any mountains in the scene?
C: no, yes
A: no
```

## Citation

```bibtex
@misc{smollm2-1-7b-it-tifa-random-2025,
  title={SmolLM2-1.7B-Instruct-TIFA-Random: Flexible Question Generation for Text-to-Image Faithfulness Assessment},
  author={kawchar85},
  year={2025},
  url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random}
}
```

## Model Series Comparison

| Model | Parameters | Dataset | Structure | Best For |
|-------|------------|---------|-----------|----------|
| [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | 135M | 5k | Fixed Q1-Q4 | Quick evaluation, resource-constrained |
| [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) | 360M | 10k | Fixed Q1-Q4 | Balanced performance |
| [1.7B](https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA) | 1.7B | 10k | Fixed Q1-Q4 | Structured evaluation |
| **1.7B-Random** | 1.7B | 18k | **Flexible** | **Research, natural evaluation** |