jsulz HF Staff commited on
Commit
688cbf7
·
1 Parent(s): da1c032

updating heatmap

Browse files
Files changed (2) hide show
  1. index.html +9 -25
  2. xorbs.json +0 -0
index.html CHANGED
@@ -3,7 +3,7 @@
3
  <head>
4
  <meta charset="UTF-8" />
5
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
6
- <title>Repo-Level Dedupe Visualization</title>
7
  <link rel="stylesheet" href="style.css" />
8
  <script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
9
  <script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
@@ -12,35 +12,19 @@
12
  <body>
13
  <div class="container">
14
  <div class="header">
15
- <h1>Visualizing Repo-Level Dedupe</h1>
16
  <p>
17
- This visualization demonstrates block-level deduplication across all
18
- models in
19
- <a
20
- target="_blank"
21
- href="https://huggingface.co/bartowski/gemma-2-9b-it-GGUF"
22
- >bartowski/gemma-2-9b-it-GGUF</a
23
- >.
24
  </p>
 
25
  <p>
26
- Each row represents a file in the repository grouped into blocks of up
27
- to 64MB. The color of each block represents the deduplication ratio
28
- for the block, which is a function of how often the chunks in the
29
- block are shared between files. The darker the color, the more
30
- frequently content is shared, the better the overall upload and
31
- download times for a given file! The deduplication savings here take a
32
- 191GB repo and cut it down to 97GB, helping to shave a few hours off
33
- the upload time.
34
  </p>
 
35
  <p>
36
- You can read more about chunks, blocks, and the nitty gritty details
37
- of how we make this all work in our accompanying
38
- <a
39
- target="_blank"
40
- href="https://huggingface.co/blog/from-chunks-to-blocks"
41
- >blog post</a
42
- >.
43
  </p>
 
44
  To explore the visualization:
45
  <ul>
46
  <li>
@@ -119,7 +103,7 @@
119
  ],
120
  field: "dedupe_factor",
121
  type: "quantitative",
122
- scale: { scheme: "blues", domain: [0, 10] },
123
  },
124
  opacity: {
125
  condition: [
 
3
  <head>
4
  <meta charset="UTF-8" />
5
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
6
+ <title>XLM-R Dedupe</title>
7
  <link rel="stylesheet" href="style.css" />
8
  <script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
9
  <script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
 
12
  <body>
13
  <div class="container">
14
  <div class="header">
15
+ <h1>Visualizing XLM-rBERTa Finetune Dedupe</h1>
16
  <p>
17
+ This heatmap shows deduplication across a family of fine-tuned models based on <a href="https://huggingface.co/papers/1911.02116">XLM-RoBERTa large</a>, a multilingual transformer introduced in 2019 and trained on 100 languages. Each row represents a model repository (which often contains multiple formats—e.g., Safetensor, Keras, PyTorch) derived from the original research. Repository data is chunked into blocks of up to 64MB in Xet's storage layer, and this heatmap visualizes those blocks across models.
 
 
 
 
 
 
18
  </p>
19
+
20
  <p>
21
+ The base model is <a href="https://huggingface.co/FacebookAI/xlm-roberta-large"><code>xlm-roberta-large</code></a>, while the others are fine-tuned for specific languages on the CoNLL NER datasets (Dutch, Spanish, English, German). Darker blue regions highlight content shared across models—the more overlap, the more efficient storage and transfer becomes. This level of deduplication leads to faster uploads, quicker iterations, and less friction when scaling experimentation.
 
 
 
 
 
 
 
22
  </p>
23
+
24
  <p>
25
+ XLM-RoBERTa large currently has <a href="https://huggingface.co/models?other=base_model:finetune:FacebookAI/xlm-roberta-large">396 fine-tunes on the Hub</a>. The fine-tunes from the original CoNLL research deduplicate at ~17%, representing a substantial time savings for builders repeatedly pushing new checkpoints and variants.
 
 
 
 
 
 
26
  </p>
27
+
28
  To explore the visualization:
29
  <ul>
30
  <li>
 
103
  ],
104
  field: "dedupe_factor",
105
  type: "quantitative",
106
+ scale: { scheme: "blues", domain: [0, 5] },
107
  },
108
  opacity: {
109
  condition: [
xorbs.json CHANGED
The diff for this file is too large to render. See raw diff