kawchar85 commited on
Commit
c311681
·
verified ·
1 Parent(s): f82de8f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +187 -3
README.md CHANGED
@@ -1,3 +1,187 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - unsloth/SmolLM2-1.7B-Instruct
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-to-image-evaluation
8
+ - faithfulness
9
+ - lora
10
+ - tifa
11
+ - unsloth
12
+ - flexible-structure
13
+ language: en
14
+ ---
15
+
16
+ # SmolLM2-1.7B-Instruct-TIFA-Random
17
+
18
+ ## Model Description
19
+
20
+ SmolLM2-1.7B-Instruct-TIFA-Random is a fine-tuned version of [unsloth/SmolLM2-1.7B-Instruct](https://huggingface.co/unsloth/SmolLM2-1.7B-Instruct) specifically trained for **TIFA (Text-to-Image Faithfulness Assessment)** with flexible question generation. Unlike previous structured versions, this model generates diverse, natural evaluation questions without rigid formatting constraints, making it more adaptable for various evaluation scenarios.
21
+
22
+ **Model Series**: [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) | [1.7B-Structured](https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA) | **1.7B-Random**
23
+
24
+ ## Key Innovation: Flexible Structure
25
+
26
+ This model represents a paradigm shift from rigid question structures to **flexible, natural question generation**:
27
+ - **Previous models**: Fixed Q1/Q2/Q3/Q4 structure with predetermined answer types
28
+ - **This model**: Dynamic question generation focusing on visual verification without structural constraints
29
+ - **Benefit**: More natural, diverse questions that better reflect real-world evaluation needs
30
+
31
+ ## Intended Use
32
+
33
+ This model generates 4 visual verification questions for text-to-image evaluation, focusing on:
34
+ - **Colors, shapes, objects, materials** - Core visual elements
35
+ - **Spatial relationships** - Positioning and arrangement
36
+ - **Presence/absence verification** - What exists or doesn't exist
37
+ - **Mixed question types** - Both yes/no and multiple choice questions
38
+ - **Natural diversity** - Questions adapt to description content rather than following templates
39
+
40
+ ## Model Details
41
+
42
+ - **Base Model**: unsloth/SmolLM2-1.7B-Instruct
43
+ - **Model Size**: 1.7B parameters
44
+ - **Fine-tuning Method**: Enhanced LoRA with flexible structure training
45
+ - **Training Framework**: Transformers + TRL + PEFT + Unsloth
46
+ - **License**: apache-2.0
47
+
48
+ ## Training Details
49
+
50
+ ### Advanced Training Configuration
51
+ - **Training Method**: Supervised Fine-Tuning with category-balanced validation
52
+ - **Enhanced LoRA Configuration**:
53
+ - r: 32
54
+ - lora_alpha: 64
55
+ - lora_dropout: 0.05
56
+ - Target modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
57
+
58
+ - **Optimized Training Parameters**:
59
+ - Epochs: 2
60
+ - Learning Rate: 5e-5
61
+ - Batch Size: 16
62
+ - Gradient Accumulation: 2 steps (effective batch size: 32)
63
+ - Max Sequence Length: 1024
64
+ - LR Scheduler: Cosine with 3% warmup
65
+ - Validation: Category-balanced evaluation every 250 steps
66
+
67
+ ### Enhanced Dataset
68
+ - **Size**: 18,000 examples
69
+ - **Structure**: Flexible question generation without rigid templates
70
+ - **Validation**: Category-balanced split ensuring robust evaluation
71
+ - **Coverage**: Diverse visual elements, materials, spatial relationships, and verification tasks
72
+
73
+ ## Usage
74
+
75
+ ### Installation
76
+
77
+ ```bash
78
+ pip install transformers torch
79
+ ```
80
+
81
+ ### Basic Usage
82
+
83
+ ```python
84
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
85
+ import torch
86
+
87
+ model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random"
88
+
89
+ # Load model and tokenizer
90
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
91
+ tokenizer.pad_token = tokenizer.eos_token
92
+ tokenizer.padding_side = "right"
93
+
94
+ model = AutoModelForCausalLM.from_pretrained(
95
+ model_path,
96
+ torch_dtype=torch.float16,
97
+ trust_remote_code=True,
98
+ device_map="auto"
99
+ )
100
+
101
+ # Create pipeline
102
+ chat_pipe = pipeline(
103
+ "text-generation",
104
+ model=model,
105
+ tokenizer=tokenizer,
106
+ return_full_text=False,
107
+ )
108
+
109
+ def get_message(description):
110
+ system = """\
111
+ You are a TIFA (Text-to-Image Faithfulness evaluation with question Answering) question generator. Given an image description, create exactly 4 visual verification questions with multiple choice answers. Each question should test different visual aspects that can be verified by looking at the image.
112
+
113
+ Guidelines:
114
+ - Focus on colors, shapes, objects, materials, spatial relationships, and other visually verifiable elements
115
+ - Mix yes/no questions (2 choices: "no", "yes") and multiple choice questions (4 choices)
116
+ - Each question should test a DIFFERENT aspect of the description
117
+ - Ensure questions can be answered by visual inspection of the image
118
+ - Use elements explicitly mentioned in the description
119
+ - Include both positive verification (testing presence, answer: "yes") and negative verification (testing absence, answer: "no")
120
+ - Make distractors realistic and relevant to the domain
121
+
122
+ Format each question as:
123
+ Q[number]: [question text]
124
+ C: [comma-separated choices]
125
+ A: [correct answer]
126
+
127
+ Generate questions that test visual faithfulness between the description and image."""
128
+
129
+ user_msg = f'Create 4 visual verification questions for this description: "{description}"'
130
+ return [
131
+ {"role": "system", "content": system},
132
+ {"role": "user", "content": user_msg}
133
+ ]
134
+
135
+ # Generate evaluation questions
136
+ description = "a lighthouse overlooking the ocean"
137
+ messages = get_message(description)
138
+
139
+ output = chat_pipe(
140
+ messages,
141
+ max_new_tokens=256,
142
+ do_sample=False,
143
+ )
144
+
145
+ print(output[0]["generated_text"])
146
+ ```
147
+
148
+ ### Example Outputs
149
+
150
+ **For "a lighthouse overlooking the ocean":**
151
+ ```
152
+ Q1: What type of structure is prominently featured?
153
+ C: windmill, lighthouse, tower, castle
154
+ A: lighthouse
155
+
156
+ Q2: What body of water is visible?
157
+ C: lake, river, ocean, pond
158
+ A: ocean
159
+
160
+ Q3: Is the lighthouse positioned above the water?
161
+ C: no, yes
162
+ A: yes
163
+
164
+ Q4: Are there any mountains in the scene?
165
+ C: no, yes
166
+ A: no
167
+ ```
168
+
169
+ ## Citation
170
+
171
+ ```bibtex
172
+ @misc{smollm2-1-7b-it-tifa-random-2025,
173
+ title={SmolLM2-1.7B-Instruct-TIFA-Random: Flexible Question Generation for Text-to-Image Faithfulness Assessment},
174
+ author={kawchar85},
175
+ year={2025},
176
+ url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA-Random}
177
+ }
178
+ ```
179
+
180
+ ## Model Series Comparison
181
+
182
+ | Model | Parameters | Dataset | Structure | Best For |
183
+ |-------|------------|---------|-----------|----------|
184
+ | [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | 135M | 5k | Fixed Q1-Q4 | Quick evaluation, resource-constrained |
185
+ | [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) | 360M | 10k | Fixed Q1-Q4 | Balanced performance |
186
+ | [1.7B](https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA) | 1.7B | 10k | Fixed Q1-Q4 | Structured evaluation |
187
+ | **1.7B-Random** | 1.7B | 18k | **Flexible** | **Research, natural evaluation** |