Update index by blog content
#1
by
yassminee
- opened
- index.html +288 -19
index.html
CHANGED
@@ -1,19 +1,288 @@
|
|
1 |
-
<!
|
2 |
-
<html>
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html lang="en">
|
3 |
+
<head>
|
4 |
+
<meta charset="UTF-8">
|
5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
6 |
+
<title>Building a Mixed-Dialect Arabic Dataset for Summarization: MSA and Moroccan Darija</title>
|
7 |
+
<style>
|
8 |
+
body {
|
9 |
+
font-family: Arial, sans-serif;
|
10 |
+
line-height: 1.6;
|
11 |
+
max-width: 900px;
|
12 |
+
margin: 0 auto;
|
13 |
+
padding: 20px;
|
14 |
+
}
|
15 |
+
h1, h2, h3 {
|
16 |
+
color: #333;
|
17 |
+
}
|
18 |
+
img {
|
19 |
+
max-width: 100%;
|
20 |
+
height: auto;
|
21 |
+
display: block;
|
22 |
+
margin: 20px 0;
|
23 |
+
}
|
24 |
+
pre {
|
25 |
+
background-color: #f5f5f5;
|
26 |
+
padding: 15px;
|
27 |
+
border-radius: 5px;
|
28 |
+
overflow-x: auto;
|
29 |
+
}
|
30 |
+
code {
|
31 |
+
font-family: monospace;
|
32 |
+
}
|
33 |
+
.figure-caption {
|
34 |
+
font-style: italic;
|
35 |
+
text-align: center;
|
36 |
+
margin-top: 5px;
|
37 |
+
margin-bottom: 20px;
|
38 |
+
}
|
39 |
+
a {
|
40 |
+
color: #0366d6;
|
41 |
+
text-decoration: none;
|
42 |
+
}
|
43 |
+
a:hover {
|
44 |
+
text-decoration: underline;
|
45 |
+
}
|
46 |
+
.authors {
|
47 |
+
font-weight: bold;
|
48 |
+
margin-bottom: 20px;
|
49 |
+
}
|
50 |
+
.date {
|
51 |
+
margin-bottom: 30px;
|
52 |
+
color: #666;
|
53 |
+
}
|
54 |
+
.reference-list {
|
55 |
+
margin-top: 20px;
|
56 |
+
}
|
57 |
+
</style>
|
58 |
+
</head>
|
59 |
+
<body>
|
60 |
+
<h1>Building a Mixed-Dialect Arabic Dataset for Summarization: MSA and Moroccan Darija</h1>
|
61 |
+
|
62 |
+
<div class="authors">Authors: Abir Harrasse, Yassmine ED-DYB</div>
|
63 |
+
<div class="date">April 7, 2025</div>
|
64 |
+
|
65 |
+
<p>In this post, we'll walk through how we created a specialized dataset for fine-tuning a small language model to summarize both Modern Standard Arabic (MSA) texts and Moroccan dialectal Arabic (Darija). We'll share the practical challenges we faced, our solutions, and provide code snippets for key steps in the process.</p>
|
66 |
+
|
67 |
+
<h2>Introduction</h2>
|
68 |
+
<p>Fine-tuning small language models (SLMs) for summarization has gained significant attention due to the increasing need for efficient and effective text summarization techniques. This blog post focuses on our approach to creating a dataset that combines Modern Standard Arabic and dialectal Arabic for a summarization task.</p>
|
69 |
+
|
70 |
+
<p>Our primary goal was to develop a dataset that can be used to fine-tune a model capable of generating high-quality summaries while operating within the constraints of a Google Colab free-tier GPU.</p>
|
71 |
+
|
72 |
+
<h2>Dataset Selection and Preparation</h2>
|
73 |
+
<p>For our project, we aimed to create a dataset that enables an SLM to summarize both Modern Standard Arabic (MSA) texts and dialectal Arabic. Given that our dataset consists of only 5,000 documents, we decided to focus specifically on the Moroccan dialect, Darija.</p>
|
74 |
+
|
75 |
+
<p>We constructed our dataset using the following composition:</p>
|
76 |
+
<ul>
|
77 |
+
<li><strong>Moroccan Dialect (Darija)</strong>: 20% of the total fine-tuning dataset</li>
|
78 |
+
<li><strong>Arabic Web Content</strong>: 60% of the dataset</li>
|
79 |
+
<li><strong>Arabic Educational Content</strong>: 20% of the dataset</li>
|
80 |
+
</ul>
|
81 |
+
|
82 |
+
<p>This distribution was deliberately chosen to reflect real-world usage patterns: web content forms the majority as it's the most common source for summarization tasks, while educational content (Wikipedia) provides more structured formal language, and dialectal content ensures the model can handle local variations in Arabic.</p>
|
83 |
+
|
84 |
+
<h3>Darija Samples (20%)</h3>
|
85 |
+
<p>We extracted Darija content from several open-source datasets:</p>
|
86 |
+
<ul>
|
87 |
+
<li>Initially, we explored the No-Arabic-Dialect-Left-Behind dataset by Atlasia, but it's no longer publicly available</li>
|
88 |
+
<li>The Darija_Dataset by JasperV13</li>
|
89 |
+
<li>We initially considered the DarijaStory dataset from MBZUAI-Paris but found it contained inappropriate content for our purposes</li>
|
90 |
+
</ul>
|
91 |
+
|
92 |
+
<p>Our first step was to analyze the dialect distribution in our initial dataset:</p>
|
93 |
+
|
94 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/ydam_HGDBg1xAiI8xKvmA.png" alt="Dialects Distribution">
|
95 |
+
<p class="figure-caption">Figure 1: Dialects' distribution across the No-Dialect-Left-Behind dataset</p>
|
96 |
+
|
97 |
+
<p>Based on this analysis, we decided to focus on Moroccan Darija. After filtering for Moroccan dialect texts, we analyzed the length distribution:</p>
|
98 |
+
|
99 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/5C6eDocY1GgXlUrhjPAyr.png" alt="Moroccan Dialect's Distribution">
|
100 |
+
<p class="figure-caption">Figure 2: Length distribution of Moroccan dialect samples</p>
|
101 |
+
|
102 |
+
<p>Our filtering process included:</p>
|
103 |
+
<ol>
|
104 |
+
<li>Removing any Latin words from the texts to retain only Arabic content</li>
|
105 |
+
<li>Analyzing text length to identify suitable candidates for summarization</li>
|
106 |
+
</ol>
|
107 |
+
|
108 |
+
<p>When working with the Darija_Dataset from JasperV13, we followed the same approach:</p>
|
109 |
+
|
110 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/sB7B9MwdNBHnFNmasGQw5.png" alt="Length Distribution JasperV13">
|
111 |
+
<p class="figure-caption">Figure 3: Length distribution of samples in the initial JasperV13/Darija_Dataset</p>
|
112 |
+
|
113 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/qTgXl3EMXDCP-2qb2UlAY.png" alt="Length Distribution After Filtering">
|
114 |
+
<p class="figure-caption">Figure 4: Length distribution of samples after Latin words' removal</p>
|
115 |
+
|
116 |
+
<p>Initially, we selected 300 documents of maximal length for annotation. However, we discovered that annotating these long documents was time-consuming (approximately 4 hours) because we had to chunk our text into many segments to feed it to the annotating model.</p>
|
117 |
+
|
118 |
+
<p>After this experience, we revised our strategy and set a maximum character limit of 5000 for all documents in our dataset.</p>
|
119 |
+
|
120 |
+
<h3>Arabic Web Content Samples (60%)</h3>
|
121 |
+
<p>For this major portion of our dataset, we focused exclusively on the Arabic FineWeb2 dataset by Ali Elfilali. This dataset has already undergone extensive cleaning, filtering, and deduplication, which saved us significant preprocessing time.</p>
|
122 |
+
|
123 |
+
<p>Our process included:</p>
|
124 |
+
<ol>
|
125 |
+
<li>Analyzing the length distribution of the dataset</li>
|
126 |
+
<li>Filtering out texts containing Latin characters</li>
|
127 |
+
<li>Setting a maximum threshold of 5000 characters, consistent with our approach for Darija</li>
|
128 |
+
</ol>
|
129 |
+
|
130 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/cXjbBAt0kgLGsBiyZmdI2.png" alt="FineWeb2 Length Distribution">
|
131 |
+
<p class="figure-caption">Figure 5: Length distribution of samples in the initial Arabic FineWeb2 data</p>
|
132 |
+
|
133 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/Y8sz4xwo_cANUOUhu5gVG.png" alt="FineWeb2 After Filtering">
|
134 |
+
<p class="figure-caption">Figure 6: Length distribution of samples after Latin words' removal</p>
|
135 |
+
|
136 |
+
<h3>Arabic Educational Content Samples (20%)</h3>
|
137 |
+
<p>For this section, we used an Arabic Wikipedia dump curated by Saied Alshahrani. Our process was similar to the previous sections:</p>
|
138 |
+
|
139 |
+
<ol>
|
140 |
+
<li>Exploring the length distribution of the data</li>
|
141 |
+
<li>Removing texts with Latin words</li>
|
142 |
+
<li>Applying the FineWeb2 pipeline to filter out data with excessive n-gram repetition, line repetition, or punctuation repetition</li>
|
143 |
+
</ol>
|
144 |
+
|
145 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/s4phtAZOwH3ugcy9lWIQ-.png" alt="Wikipedia Length Distribution">
|
146 |
+
<p class="figure-caption">Figure 7: Length distribution of samples in the initial Arabic Wikipedia data</p>
|
147 |
+
|
148 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/hmshXuVxOGEfLpOpMVshj.png" alt="Wikipedia After Filtering">
|
149 |
+
<p class="figure-caption">Figure 8: Length distribution of samples after Latin words' removal</p>
|
150 |
+
|
151 |
+
<h2>Dataset Annotation</h2>
|
152 |
+
|
153 |
+
<h3>Model Selection</h3>
|
154 |
+
<p>We chose to perform synthetic annotation, using a large language model to generate summaries for our dataset. The constraints of the free-tier Colab environment limited our model selection.</p>
|
155 |
+
|
156 |
+
<p>In theory, the free-tier Colab limitation meant we could not use models larger than 20B under 4-bit quantization. However, in practice, we found that:</p>
|
157 |
+
|
158 |
+
<ul>
|
159 |
+
<li>Even for a 13B model like Jais 13B, the quantized version did not perform optimally</li>
|
160 |
+
<li>Smaller models, such as AtlasChat 9B, did not achieve comparable performance</li>
|
161 |
+
</ul>
|
162 |
+
|
163 |
+
<p>After experimentation, we selected Jais 13B as our annotation model, providing the best balance between performance and feasibility within our constraints. Running this model required approximately 20GB of GPU RAM to generate reliable summaries, which we managed through careful memory optimization and gradient checkpointing.</p>
|
164 |
+
|
165 |
+
<h3>Annotation Process</h3>
|
166 |
+
<p>For annotation, we used an Alpaca format prompt structured as follows:</p>
|
167 |
+
|
168 |
+
<pre><code>### Instruction: قم بتلخيص المقال التالي بطريقة مختصرة وو��ضحة في أقل من 50 كلمة:
|
169 |
+
|
170 |
+
{text}
|
171 |
+
|
172 |
+
### الملخص:</code></pre>
|
173 |
+
|
174 |
+
<p>For longer texts, we implemented chunking with a maximum of 1700 tokens, given that our model has a context size of 2048 tokens. Here's the key code we used for our annotation process:</p>
|
175 |
+
|
176 |
+
<pre><code>def safe_generate(text, max_input_tokens=1700):
|
177 |
+
token_count = count_tokens(text)
|
178 |
+
|
179 |
+
if token_count > max_input_tokens:
|
180 |
+
print(f"Input too long ({token_count} tokens). Truncating to {max_input_tokens} tokens.")
|
181 |
+
tokens = tokenizer.encode(text)
|
182 |
+
text = tokenizer.decode(tokens[:max_input_tokens])
|
183 |
+
|
184 |
+
torch.cuda.empty_cache()
|
185 |
+
gc.collect()
|
186 |
+
|
187 |
+
prompt = f"### Instruction: قم بتلخيص المقال التالي بطريقة مختصرة وواضحة في أقل من 50 كلمة:\n\n{text}\n\n### الملخص:"
|
188 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
189 |
+
|
190 |
+
with torch.no_grad():
|
191 |
+
outputs = model.generate(
|
192 |
+
**inputs,
|
193 |
+
max_new_tokens=100,
|
194 |
+
do_sample=False,
|
195 |
+
num_beams=1
|
196 |
+
)
|
197 |
+
|
198 |
+
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
199 |
+
|
200 |
+
if "### الملخص:" in full_response:
|
201 |
+
summary = full_response.split("### الملخص:")[1].strip()
|
202 |
+
else:
|
203 |
+
summary = full_response.replace(prompt, "").strip()
|
204 |
+
|
205 |
+
del inputs, outputs
|
206 |
+
torch.cuda.empty_cache()
|
207 |
+
gc.collect()
|
208 |
+
|
209 |
+
return summary</code></pre>
|
210 |
+
|
211 |
+
<p>For longer documents exceeding the token limit, we implemented a chunking mechanism:</p>
|
212 |
+
|
213 |
+
<pre><code>def process_article(text, max_chunk_tokens=1700):
|
214 |
+
if count_tokens(text) <= max_chunk_tokens:
|
215 |
+
return safe_generate(text, max_chunk_tokens)
|
216 |
+
|
217 |
+
tokens = tokenizer.encode(text)
|
218 |
+
chunks = []
|
219 |
+
|
220 |
+
for i in range(0, len(tokens), max_chunk_tokens):
|
221 |
+
chunk_tokens = tokens[i:i+max_chunk_tokens]
|
222 |
+
chunk_text = tokenizer.decode(chunk_tokens)
|
223 |
+
chunks.append(chunk_text)
|
224 |
+
|
225 |
+
print(f"Split into {len(chunks)} chunks")
|
226 |
+
|
227 |
+
summaries = []
|
228 |
+
for i, chunk in enumerate(chunks):
|
229 |
+
result = safe_generate(chunk, max_chunk_tokens)
|
230 |
+
summaries.append(result)
|
231 |
+
|
232 |
+
torch.cuda.empty_cache()
|
233 |
+
gc.collect()
|
234 |
+
import time
|
235 |
+
time.sleep(2) # Adding delay to prevent GPU OOM errors
|
236 |
+
|
237 |
+
return " ".join(summaries)</code></pre>
|
238 |
+
|
239 |
+
<p>We observed that due to the quantization of our model, performance decreased as the length of the input text increased, which further justified our chunking approach.</p>
|
240 |
+
|
241 |
+
<p>To ensure quality, we validated the generated summaries by:</p>
|
242 |
+
<ol>
|
243 |
+
<li>Checking their lengths to ensure they were appropriately concise</li>
|
244 |
+
<li>Manually reviewing a random sample to verify they captured the main points of the original text</li>
|
245 |
+
<li>Ensuring they maintained the same language variety as the source (MSA or Darija)</li>
|
246 |
+
</ol>
|
247 |
+
|
248 |
+
<p>We generated summaries for all 5,000 documents and stored them in our <a href="https://huggingface.co/datasets/abir-hr196/mixed-darija-msa-summarization">mixed-darija-msa-summarization</a> dataset.</p>
|
249 |
+
|
250 |
+
<h2>Data Splitting</h2>
|
251 |
+
<p>To ensure proper structure for fine-tuning, we performed stratified data splitting, allocating 80% of the data for training and 20% for testing, with a fixed random seed to ensure reproducibility.</p>
|
252 |
+
|
253 |
+
<p>Stratifying the data ensures that the proportion of each category (Darija, MSA from web, MSA from Wikipedia) remains consistent between the train and test sets.</p>
|
254 |
+
|
255 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6728ab23f82f515dcc4d2653/uD4Ejo8BWBhUq68XIjDS1.png" alt="Category Distribution">
|
256 |
+
<p class="figure-caption">Figure 9: Category distribution in different data sets after Stratified Sampling</p>
|
257 |
+
|
258 |
+
<h2>Challenges and Limitations</h2>
|
259 |
+
<p>Throughout the dataset creation process, we encountered several challenges:</p>
|
260 |
+
|
261 |
+
<ol>
|
262 |
+
<li><strong>Computational constraints</strong>: Running the 13B parameter model required careful memory management. We implemented aggressive garbage collection and added delays between processing chunks to prevent GPU out-of-memory errors.</li>
|
263 |
+
|
264 |
+
<li><strong>Text length management</strong>: Many texts in our initial dataset exceeded the model's context window, requiring us to implement chunking. This added complexity to the annotation process and potentially affected summary quality for very long documents.</li>
|
265 |
+
|
266 |
+
<li><strong>Dialect representation</strong>: Finding high-quality, clean Darija text was challenging, as many datasets mixed Arabic script with Latin characters or contained inappropriate content.</li>
|
267 |
+
|
268 |
+
<li><strong>Dataset limitations</strong>: Our 5,000 document dataset, while substantial, may not represent all variations of Arabic dialects. We focused specifically on Moroccan Darija, which limits the model's applicability to other Arabic dialects.</li>
|
269 |
+
</ol>
|
270 |
+
|
271 |
+
<h2>Conclusion</h2>
|
272 |
+
<p>By carefully selecting, filtering, and annotating our data, we've created a balanced dataset of 5,000 documents combining Moroccan Darija and Modern Standard Arabic texts from both web content and educational sources. This dataset provides a solid foundation for fine-tuning a small language model for Arabic summarization tasks.</p>
|
273 |
+
|
274 |
+
<p>The code and methodologies we've shared should be adaptable to other languages with similar diglossia situations (formal vs. dialectal variants), making this approach valuable beyond just Arabic NLP.</p>
|
275 |
+
|
276 |
+
<p>In the next part of this series, we'll cover the model selection, fine-tuning process, and evaluation of our summarization model.</p>
|
277 |
+
|
278 |
+
<h2>References</h2>
|
279 |
+
<ol class="reference-list">
|
280 |
+
<li>Saied Alshahrani. <a href="https://huggingface.co/datasets/SaiedAlshahrani/Arabic_Wikipedia_20230101_bots">Arabic Wikipedia Dataset (2023-01-01)</a></li>
|
281 |
+
<li>Atlasia. <a href="https://huggingface.co/datasets/atlasia/No-Arabic-Dialect-Left-Behind">No-Arabic-Dialect-Left-Behind</a></li>
|
282 |
+
<li>Ali Elfilali. <a href="https://huggingface.co/datasets/alielfilali01/fineweb-2-arb_Arab">FineWeb2 Arabic Subset</a></li>
|
283 |
+
<li>JasperV13. <a href="https://huggingface.co/datasets/JasperV13/Darija_Dataset">Darija Dataset</a></li>
|
284 |
+
<li>MBZUAI-Paris. <a href="https://huggingface.co/datasets/MBZUAI-Paris/DarijaStory">DarijaStory Dataset</a></li>
|
285 |
+
<li>Guilherme Penedo et al. <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-2">FineWeb2: A sparkling update with 1000s of languages</a></li>
|
286 |
+
</ol>
|
287 |
+
</body>
|
288 |
+
</html>
|