vrashad commited on
Commit
d68ede2
·
verified ·
1 Parent(s): b4065ec

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,216 +1,8 @@
1
- ---
2
- license: cc-by-4.0
3
- language:
4
- - az
5
- base_model:
6
- - FacebookAI/xlm-roberta-base
7
- pipeline_tag: token-classification
8
- tags:
9
- - personally identifiable information
10
- - pii
11
- - ner
12
- - azerbaijan
13
- ---
14
-
15
-
16
- # PII NER Azerbaijani v2
17
-
18
- **PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: <a target="_blank" href="https://huggingface.co/LocalDoc/private_ner_azerbaijani">PII NER Azerbaijani</a>) based on XLM-RoBERTa.
19
- It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.
20
-
21
- ## Model Details
22
-
23
- - **Base Model:** XLM-RoBERTa
24
- - **Training Metrics:**
25
- -
26
- | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
27
- |-------|----------------|------------------|-----------|---------|----------|
28
- | 1 | 0.029100 | 0.025319 | 0.963367 | 0.962449| 0.962907 |
29
- | 2 | 0.019900 | 0.023291 | 0.964567 | 0.968474| 0.966517 |
30
- | 3 | 0.015400 | 0.018993 | 0.969536 | 0.967555| 0.968544 |
31
- | 4 | 0.012700 | 0.017730 | 0.971919 | 0.969768| 0.970842 |
32
- | 5 | 0.011100 | 0.018095 | 0.973056 | 0.970075| 0.971563 |
33
-
34
-
35
-
36
- - **Test Metrics:**
37
-
38
- - **Precision:** 0.9760
39
- - **Recall:** 0.9732
40
- - **F1 Score:** 0.9746
41
-
42
-
43
- ## Detailed Test Classification Report
44
-
45
- | Entity | Precision | Recall | F1-score | Support |
46
- |---------------------|-----------|--------|----------|---------|
47
- | AGE | 0.98 | 0.98 | 0.98 | 509 |
48
- | BUILDINGNUM | 0.97 | 0.75 | 0.85 | 1285 |
49
- | CITY | 1.00 | 1.00 | 1.00 | 2100 |
50
- | CREDITCARDNUMBER | 0.99 | 0.98 | 0.99 | 249 |
51
- | DATE | 0.85 | 0.92 | 0.88 | 1576 |
52
- | DRIVERLICENSENUM | 0.98 | 0.98 | 0.98 | 258 |
53
- | EMAIL | 0.98 | 1.00 | 0.99 | 1485 |
54
- | GIVENNAME | 0.99 | 1.00 | 0.99 | 9926 |
55
- | IDCARDNUM | 0.99 | 0.99 | 0.99 | 1174 |
56
- | PASSPORTNUM | 0.99 | 0.99 | 0.99 | 426 |
57
- | STREET | 0.94 | 0.98 | 0.96 | 1480 |
58
- | SURNAME | 1.00 | 1.00 | 1.00 | 3357 |
59
- | TAXNUM | 0.99 | 1.00 | 0.99 | 240 |
60
- | TELEPHONENUM | 0.97 | 0.95 | 0.96 | 2175 |
61
- | TIME | 0.96 | 0.96 | 0.96 | 2216 |
62
- | ZIPCODE | 0.97 | 0.97 | 0.97 | 520 |
63
-
64
-
65
- ### Averages
66
-
67
- | Metric | Precision | Recall | F1-score | Support |
68
- |---------------|-----------|--------|----------|---------|
69
- | **Micro avg** | 0.98 | 0.97 | 0.97 | 28976 |
70
- | **Macro avg** | 0.97 | 0.96 | 0.97 | 28976 |
71
- | **Weighted avg** | 0.98 | 0.97 | 0.97 | 28976 |
72
-
73
-
74
- ## A list of entities that the model is able to recognize.
75
-
76
- ```python
77
- [
78
- "AGE",
79
- "BUILDINGNUM",
80
- "CITY",
81
- "CREDITCARDNUMBER",
82
- "DATE",
83
- "DRIVERLICENSENUM",
84
- "EMAIL",
85
- "GIVENNAME",
86
- "IDCARDNUM",
87
- "PASSPORTNUM",
88
- "STREET",
89
- "SURNAME",
90
- "TAXNUM",
91
- "TELEPHONENUM",
92
- "TIME",
93
- "ZIPCODE"
94
- ]
95
-
96
- ```
97
-
98
- ## Usage
99
-
100
- To use the model for spell correction:
101
-
102
- ```python
103
- import torch
104
- from transformers import AutoTokenizer, AutoModelForTokenClassification
105
-
106
- model_id = "LocalDoc/private_ner_azerbaijani_v2"
107
-
108
- tokenizer = AutoTokenizer.from_pretrained(model_id)
109
- model = AutoModelForTokenClassification.from_pretrained(model_id)
110
-
111
- test_text = (
112
- "Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir."
113
- )
114
-
115
- inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True)
116
-
117
- offset_mapping = inputs.pop("offset_mapping")
118
-
119
- with torch.no_grad():
120
- outputs = model(**inputs)
121
-
122
- predictions = torch.argmax(outputs.logits, dim=2)
123
-
124
- tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
125
- offset_mapping = offset_mapping[0].tolist()
126
- predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
127
- word_ids = inputs.word_ids(batch_index=0)
128
-
129
- aggregated = []
130
- prev_word_id = None
131
- for idx, word_id in enumerate(word_ids):
132
- if word_id is None:
133
- continue
134
- if word_id != prev_word_id:
135
- aggregated.append({
136
- "word_id": word_id,
137
- "tokens": [tokens[idx]],
138
- "offsets": [offset_mapping[idx]],
139
- "label": predicted_labels[idx]
140
- })
141
- else:
142
- aggregated[-1]["tokens"].append(tokens[idx])
143
- aggregated[-1]["offsets"].append(offset_mapping[idx])
144
- prev_word_id = word_id
145
-
146
- entities = []
147
- current_entity = None
148
- for word in aggregated:
149
- if word["label"] == "O":
150
- if current_entity is not None:
151
- entities.append(current_entity)
152
- current_entity = None
153
- else:
154
- if current_entity is None:
155
- current_entity = {
156
- "type": word["label"],
157
- "start": word["offsets"][0][0],
158
- "end": word["offsets"][-1][1]
159
- }
160
- else:
161
- if word["label"] == current_entity["type"]:
162
- current_entity["end"] = word["offsets"][-1][1]
163
- else:
164
- entities.append(current_entity)
165
- current_entity = {
166
- "type": word["label"],
167
- "start": word["offsets"][0][0],
168
- "end": word["offsets"][-1][1]
169
- }
170
- if current_entity is not None:
171
- entities.append(current_entity)
172
-
173
- for entity in entities:
174
- entity["text"] = test_text[entity["start"]:entity["end"]]
175
-
176
- for entity in entities:
177
- print(entity)
178
- ```
179
-
180
- ```json
181
- {'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'}
182
- {'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'}
183
- {'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'}
184
- {'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'}
185
- {'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'}
186
- {'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'}
187
- ```
188
-
189
-
190
- ## CC BY 4.0 License — What It Allows
191
-
192
- The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:
193
-
194
- ### ✅ You Can:
195
- - **Use** the model for any purpose, including commercial use.
196
- - **Share** it — copy and redistribute in any medium or format.
197
- - **Adapt** it — remix, transform, and build upon it for any purpose, even commercially.
198
-
199
- ### 📝 You Must:
200
- - **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
201
- - **Not imply endorsement** — Do not suggest the original author endorses you or your use.
202
-
203
- ### ❌ You Cannot:
204
- - Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).
205
-
206
-
207
- ### Summary:
208
- You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
209
-
210
-
211
- For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY-NC-ND 4.0 license</a>.
212
-
213
-
214
- ## Contact
215
-
216
- For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
 
1
+ # Azerbaijani NER Model
2
+
3
+ This is a Named Entity Recognition model for Azerbaijani language based on XLM-RoBERTa.
4
+
5
+ ## Model details
6
+ - Base model: xlm-roberta-base
7
+ - Trained for Named Entity Recognition
8
+ - Language: Azerbaijani
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./results/checkpoint-15080",
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "O",
15
+ "1": "B-AGE",
16
+ "2": "B-BUILDINGNUM",
17
+ "3": "B-CITY",
18
+ "4": "B-CREDITCARDNUMBER",
19
+ "5": "B-DATE",
20
+ "6": "B-DRIVERLICENSENUM",
21
+ "7": "B-EMAIL",
22
+ "8": "B-GIVENNAME",
23
+ "9": "B-IDCARDNUM",
24
+ "10": "B-PASSPORTNUM",
25
+ "11": "B-STREET",
26
+ "12": "B-SURNAME",
27
+ "13": "B-TAXNUM",
28
+ "14": "B-TELEPHONENUM",
29
+ "15": "B-TIME",
30
+ "16": "B-ZIPCODE",
31
+ "17": "I-AGE",
32
+ "18": "I-BUILDINGNUM",
33
+ "19": "I-CITY",
34
+ "20": "I-CREDITCARDNUMBER",
35
+ "21": "I-DATE",
36
+ "22": "I-DRIVERLICENSENUM",
37
+ "23": "I-EMAIL",
38
+ "24": "I-GIVENNAME",
39
+ "25": "I-IDCARDNUM",
40
+ "26": "I-PASSPORTNUM",
41
+ "27": "I-STREET",
42
+ "28": "I-SURNAME",
43
+ "29": "I-TAXNUM",
44
+ "30": "I-TELEPHONENUM",
45
+ "31": "I-TIME",
46
+ "32": "I-ZIPCODE"
47
+ },
48
+ "initializer_range": 0.02,
49
+ "intermediate_size": 3072,
50
+ "label2id": {
51
+ "B-AGE": 1,
52
+ "B-BUILDINGNUM": 2,
53
+ "B-CITY": 3,
54
+ "B-CREDITCARDNUMBER": 4,
55
+ "B-DATE": 5,
56
+ "B-DRIVERLICENSENUM": 6,
57
+ "B-EMAIL": 7,
58
+ "B-GIVENNAME": 8,
59
+ "B-IDCARDNUM": 9,
60
+ "B-PASSPORTNUM": 10,
61
+ "B-STREET": 11,
62
+ "B-SURNAME": 12,
63
+ "B-TAXNUM": 13,
64
+ "B-TELEPHONENUM": 14,
65
+ "B-TIME": 15,
66
+ "B-ZIPCODE": 16,
67
+ "I-AGE": 17,
68
+ "I-BUILDINGNUM": 18,
69
+ "I-CITY": 19,
70
+ "I-CREDITCARDNUMBER": 20,
71
+ "I-DATE": 21,
72
+ "I-DRIVERLICENSENUM": 22,
73
+ "I-EMAIL": 23,
74
+ "I-GIVENNAME": 24,
75
+ "I-IDCARDNUM": 25,
76
+ "I-PASSPORTNUM": 26,
77
+ "I-STREET": 27,
78
+ "I-SURNAME": 28,
79
+ "I-TAXNUM": 29,
80
+ "I-TELEPHONENUM": 30,
81
+ "I-TIME": 31,
82
+ "I-ZIPCODE": 32,
83
+ "O": 0
84
+ },
85
+ "layer_norm_eps": 1e-05,
86
+ "max_position_embeddings": 514,
87
+ "model_type": "xlm-roberta",
88
+ "num_attention_heads": 12,
89
+ "num_hidden_layers": 12,
90
+ "output_past": true,
91
+ "pad_token_id": 1,
92
+ "position_embedding_type": "absolute",
93
+ "torch_dtype": "float32",
94
+ "transformers_version": "4.38.0",
95
+ "type_vocab_size": 1,
96
+ "use_cache": true,
97
+ "vocab_size": 250002
98
+ }
label_mapping.json ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "id_to_label": {
3
+ "0": "O",
4
+ "1": "B-AGE",
5
+ "2": "B-BUILDINGNUM",
6
+ "3": "B-CITY",
7
+ "4": "B-CREDITCARDNUMBER",
8
+ "5": "B-DATE",
9
+ "6": "B-DRIVERLICENSENUM",
10
+ "7": "B-EMAIL",
11
+ "8": "B-GIVENNAME",
12
+ "9": "B-IDCARDNUM",
13
+ "10": "B-PASSPORTNUM",
14
+ "11": "B-STREET",
15
+ "12": "B-SURNAME",
16
+ "13": "B-TAXNUM",
17
+ "14": "B-TELEPHONENUM",
18
+ "15": "B-TIME",
19
+ "16": "B-ZIPCODE",
20
+ "17": "I-AGE",
21
+ "18": "I-BUILDINGNUM",
22
+ "19": "I-CITY",
23
+ "20": "I-CREDITCARDNUMBER",
24
+ "21": "I-DATE",
25
+ "22": "I-DRIVERLICENSENUM",
26
+ "23": "I-EMAIL",
27
+ "24": "I-GIVENNAME",
28
+ "25": "I-IDCARDNUM",
29
+ "26": "I-PASSPORTNUM",
30
+ "27": "I-STREET",
31
+ "28": "I-SURNAME",
32
+ "29": "I-TAXNUM",
33
+ "30": "I-TELEPHONENUM",
34
+ "31": "I-TIME",
35
+ "32": "I-ZIPCODE"
36
+ },
37
+ "label_to_id": {
38
+ "O": 0,
39
+ "B-AGE": 1,
40
+ "B-BUILDINGNUM": 2,
41
+ "B-CITY": 3,
42
+ "B-CREDITCARDNUMBER": 4,
43
+ "B-DATE": 5,
44
+ "B-DRIVERLICENSENUM": 6,
45
+ "B-EMAIL": 7,
46
+ "B-GIVENNAME": 8,
47
+ "B-IDCARDNUM": 9,
48
+ "B-PASSPORTNUM": 10,
49
+ "B-STREET": 11,
50
+ "B-SURNAME": 12,
51
+ "B-TAXNUM": 13,
52
+ "B-TELEPHONENUM": 14,
53
+ "B-TIME": 15,
54
+ "B-ZIPCODE": 16,
55
+ "I-AGE": 17,
56
+ "I-BUILDINGNUM": 18,
57
+ "I-CITY": 19,
58
+ "I-CREDITCARDNUMBER": 20,
59
+ "I-DATE": 21,
60
+ "I-DRIVERLICENSENUM": 22,
61
+ "I-EMAIL": 23,
62
+ "I-GIVENNAME": 24,
63
+ "I-IDCARDNUM": 25,
64
+ "I-PASSPORTNUM": 26,
65
+ "I-STREET": 27,
66
+ "I-SURNAME": 28,
67
+ "I-TAXNUM": 29,
68
+ "I-TELEPHONENUM": 30,
69
+ "I-TIME": 31,
70
+ "I-ZIPCODE": 32
71
+ },
72
+ "unique_labels": [
73
+ "AGE",
74
+ "BUILDINGNUM",
75
+ "CITY",
76
+ "CREDITCARDNUMBER",
77
+ "DATE",
78
+ "DRIVERLICENSENUM",
79
+ "EMAIL",
80
+ "GIVENNAME",
81
+ "IDCARDNUM",
82
+ "PASSPORTNUM",
83
+ "STREET",
84
+ "SURNAME",
85
+ "TAXNUM",
86
+ "TELEPHONENUM",
87
+ "TIME",
88
+ "ZIPCODE"
89
+ ]
90
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51b296ad6d7359d18118e4269b6e7d74e8e61d5d2d495935ffb12a7a1f92a835
3
+ size 1109937788
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f59925fcb90c92b894cb93e51bb9b4a6105c5c249fe54ce1c704420ac39b81af
3
+ size 17082756
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }