Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +8 -216
config.json +98 -0
label_mapping.json +90 -0
model.safetensors +3 -0
special_tokens_map.json +15 -0
tokenizer.json +3 -0
tokenizer_config.json +54 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,216 +1,8 @@
----
-license: cc-by-4.0
-language:
-- az
-base_model:
-- FacebookAI/xlm-roberta-base
-pipeline_tag: token-classification
-tags:
-- personally identifiable information
-- pii
-- ner
-- azerbaijan
----
-# PII NER Azerbaijani v2
-**PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: <a target="_blank" href="https://huggingface.co/LocalDoc/private_ner_azerbaijani">PII NER Azerbaijani</a>) based on XLM-RoBERTa.
-It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.
-## Model Details
-- **Base Model:** XLM-RoBERTa
-- **Training Metrics:**
--
-| Epoch | Training Loss | Validation Loss | Precision | Recall  | F1       |
-|-------|----------------|------------------|-----------|---------|----------|
-| 1     | 0.029100       | 0.025319         | 0.963367  | 0.962449| 0.962907 |
-| 2     | 0.019900       | 0.023291         | 0.964567  | 0.968474| 0.966517 |
-| 3     | 0.015400       | 0.018993         | 0.969536  | 0.967555| 0.968544 |
-| 4     | 0.012700       | 0.017730         | 0.971919  | 0.969768| 0.970842 |
-| 5     | 0.011100       | 0.018095         | 0.973056  | 0.970075| 0.971563 |
-- **Test Metrics:**
-- **Precision:** 0.9760
-- **Recall:** 0.9732
-- **F1 Score:** 0.9746
-## Detailed Test Classification Report
-| Entity              | Precision | Recall | F1-score | Support |
-|---------------------|-----------|--------|----------|---------|
-| AGE                 | 0.98      | 0.98   | 0.98     | 509     |
-| BUILDINGNUM         | 0.97      | 0.75   | 0.85     | 1285    |
-| CITY                | 1.00      | 1.00   | 1.00     | 2100    |
-| CREDITCARDNUMBER    | 0.99      | 0.98   | 0.99     | 249     |
-| DATE                | 0.85      | 0.92   | 0.88     | 1576    |
-| DRIVERLICENSENUM    | 0.98      | 0.98   | 0.98     | 258     |
-| EMAIL               | 0.98      | 1.00   | 0.99     | 1485    |
-| GIVENNAME           | 0.99      | 1.00   | 0.99     | 9926    |
-| IDCARDNUM           | 0.99      | 0.99   | 0.99     | 1174    |
-| PASSPORTNUM         | 0.99      | 0.99   | 0.99     | 426     |
-| STREET              | 0.94      | 0.98   | 0.96     | 1480    |
-| SURNAME             | 1.00      | 1.00   | 1.00     | 3357    |
-| TAXNUM              | 0.99      | 1.00   | 0.99     | 240     |
-| TELEPHONENUM        | 0.97      | 0.95   | 0.96     | 2175    |
-| TIME                | 0.96      | 0.96   | 0.96     | 2216    |
-| ZIPCODE             | 0.97      | 0.97   | 0.97     | 520     |
-### Averages
-| Metric        | Precision | Recall | F1-score | Support |
-|---------------|-----------|--------|----------|---------|
-| **Micro avg** | 0.98      | 0.97   | 0.97     | 28976   |
-| **Macro avg** | 0.97      | 0.96   | 0.97     | 28976   |
-| **Weighted avg** | 0.98   | 0.97   | 0.97     | 28976   |
-## A list of entities that the model is able to recognize.
-```python
-[
-    "AGE",
-    "BUILDINGNUM",
-    "CITY",
-    "CREDITCARDNUMBER",
-    "DATE",
-    "DRIVERLICENSENUM",
-    "EMAIL",
-    "GIVENNAME",
-    "IDCARDNUM",
-    "PASSPORTNUM",
-    "STREET",
-    "SURNAME",
-    "TAXNUM",
-    "TELEPHONENUM",
-    "TIME",
-    "ZIPCODE"
-]
-```
-## Usage
-To use the model for spell correction:
-```python
-import torch
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-model_id = "LocalDoc/private_ner_azerbaijani_v2"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForTokenClassification.from_pretrained(model_id)
-test_text = (
-    "Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir."
-)
-inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True)
-offset_mapping = inputs.pop("offset_mapping")
-with torch.no_grad():
-    outputs = model(**inputs)
-predictions = torch.argmax(outputs.logits, dim=2)
-tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
-offset_mapping = offset_mapping[0].tolist()
-predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
-word_ids = inputs.word_ids(batch_index=0)
-aggregated = []
-prev_word_id = None
-for idx, word_id in enumerate(word_ids):
-    if word_id is None:
-        continue
-    if word_id != prev_word_id:
-        aggregated.append({
-            "word_id": word_id,
-            "tokens": [tokens[idx]],
-            "offsets": [offset_mapping[idx]],
-            "label": predicted_labels[idx]
-        })
-    else:
-        aggregated[-1]["tokens"].append(tokens[idx])
-        aggregated[-1]["offsets"].append(offset_mapping[idx])
-    prev_word_id = word_id
-entities = []
-current_entity = None
-for word in aggregated:
-    if word["label"] == "O":
-        if current_entity is not None:
-            entities.append(current_entity)
-            current_entity = None
-    else:
-        if current_entity is None:
-            current_entity = {
-                "type": word["label"],
-                "start": word["offsets"][0][0],
-                "end": word["offsets"][-1][1]
-            }
-        else:
-            if word["label"] == current_entity["type"]:
-                current_entity["end"] = word["offsets"][-1][1]
-            else:
-                entities.append(current_entity)
-                current_entity = {
-                    "type": word["label"],
-                    "start": word["offsets"][0][0],
-                    "end": word["offsets"][-1][1]
-                }
-if current_entity is not None:
-    entities.append(current_entity)
-for entity in entities:
-    entity["text"] = test_text[entity["start"]:entity["end"]]
-for entity in entities:
-    print(entity)
-```
-```json
-{'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'}
-{'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'}
-{'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'}
-{'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'}
-{'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'}
-{'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'}
-```
-## CC BY 4.0 License — What It Allows
-The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:
-### ✅ You Can:
-- **Use** the model for any purpose, including commercial use.
-- **Share** it — copy and redistribute in any medium or format.
-- **Adapt** it — remix, transform, and build upon it for any purpose, even commercially.
-### 📝 You Must:
-- **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
-- **Not imply endorsement** — Do not suggest the original author endorses you or your use.
-### ❌ You Cannot:
-- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).
-### Summary:
-You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
-For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY-NC-ND 4.0 license</a>.
-## Contact
-For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].

+# Azerbaijani NER Model
+This is a Named Entity Recognition model for Azerbaijani language based on XLM-RoBERTa.
+## Model details
+- Base model: xlm-roberta-base
+- Trained for Named Entity Recognition
+- Language: Azerbaijani

config.json ADDED Viewed

	@@ -0,0 +1,98 @@

+{
+  "_name_or_path": "./results/checkpoint-15080",
+  "architectures": [
+    "XLMRobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B-AGE",
+    "2": "B-BUILDINGNUM",
+    "3": "B-CITY",
+    "4": "B-CREDITCARDNUMBER",
+    "5": "B-DATE",
+    "6": "B-DRIVERLICENSENUM",
+    "7": "B-EMAIL",
+    "8": "B-GIVENNAME",
+    "9": "B-IDCARDNUM",
+    "10": "B-PASSPORTNUM",
+    "11": "B-STREET",
+    "12": "B-SURNAME",
+    "13": "B-TAXNUM",
+    "14": "B-TELEPHONENUM",
+    "15": "B-TIME",
+    "16": "B-ZIPCODE",
+    "17": "I-AGE",
+    "18": "I-BUILDINGNUM",
+    "19": "I-CITY",
+    "20": "I-CREDITCARDNUMBER",
+    "21": "I-DATE",
+    "22": "I-DRIVERLICENSENUM",
+    "23": "I-EMAIL",
+    "24": "I-GIVENNAME",
+    "25": "I-IDCARDNUM",
+    "26": "I-PASSPORTNUM",
+    "27": "I-STREET",
+    "28": "I-SURNAME",
+    "29": "I-TAXNUM",
+    "30": "I-TELEPHONENUM",
+    "31": "I-TIME",
+    "32": "I-ZIPCODE"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-AGE": 1,
+    "B-BUILDINGNUM": 2,
+    "B-CITY": 3,
+    "B-CREDITCARDNUMBER": 4,
+    "B-DATE": 5,
+    "B-DRIVERLICENSENUM": 6,
+    "B-EMAIL": 7,
+    "B-GIVENNAME": 8,
+    "B-IDCARDNUM": 9,
+    "B-PASSPORTNUM": 10,
+    "B-STREET": 11,
+    "B-SURNAME": 12,
+    "B-TAXNUM": 13,
+    "B-TELEPHONENUM": 14,
+    "B-TIME": 15,
+    "B-ZIPCODE": 16,
+    "I-AGE": 17,
+    "I-BUILDINGNUM": 18,
+    "I-CITY": 19,
+    "I-CREDITCARDNUMBER": 20,
+    "I-DATE": 21,
+    "I-DRIVERLICENSENUM": 22,
+    "I-EMAIL": 23,
+    "I-GIVENNAME": 24,
+    "I-IDCARDNUM": 25,
+    "I-PASSPORTNUM": 26,
+    "I-STREET": 27,
+    "I-SURNAME": 28,
+    "I-TAXNUM": 29,
+    "I-TELEPHONENUM": 30,
+    "I-TIME": 31,
+    "I-ZIPCODE": 32,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.38.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

label_mapping.json ADDED Viewed

	@@ -0,0 +1,90 @@

+{
+  "id_to_label": {
+    "0": "O",
+    "1": "B-AGE",
+    "2": "B-BUILDINGNUM",
+    "3": "B-CITY",
+    "4": "B-CREDITCARDNUMBER",
+    "5": "B-DATE",
+    "6": "B-DRIVERLICENSENUM",
+    "7": "B-EMAIL",
+    "8": "B-GIVENNAME",
+    "9": "B-IDCARDNUM",
+    "10": "B-PASSPORTNUM",
+    "11": "B-STREET",
+    "12": "B-SURNAME",
+    "13": "B-TAXNUM",
+    "14": "B-TELEPHONENUM",
+    "15": "B-TIME",
+    "16": "B-ZIPCODE",
+    "17": "I-AGE",
+    "18": "I-BUILDINGNUM",
+    "19": "I-CITY",
+    "20": "I-CREDITCARDNUMBER",
+    "21": "I-DATE",
+    "22": "I-DRIVERLICENSENUM",
+    "23": "I-EMAIL",
+    "24": "I-GIVENNAME",
+    "25": "I-IDCARDNUM",
+    "26": "I-PASSPORTNUM",
+    "27": "I-STREET",
+    "28": "I-SURNAME",
+    "29": "I-TAXNUM",
+    "30": "I-TELEPHONENUM",
+    "31": "I-TIME",
+    "32": "I-ZIPCODE"
+  },
+  "label_to_id": {
+    "O": 0,
+    "B-AGE": 1,
+    "B-BUILDINGNUM": 2,
+    "B-CITY": 3,
+    "B-CREDITCARDNUMBER": 4,
+    "B-DATE": 5,
+    "B-DRIVERLICENSENUM": 6,
+    "B-EMAIL": 7,
+    "B-GIVENNAME": 8,
+    "B-IDCARDNUM": 9,
+    "B-PASSPORTNUM": 10,
+    "B-STREET": 11,
+    "B-SURNAME": 12,
+    "B-TAXNUM": 13,
+    "B-TELEPHONENUM": 14,
+    "B-TIME": 15,
+    "B-ZIPCODE": 16,
+    "I-AGE": 17,
+    "I-BUILDINGNUM": 18,
+    "I-CITY": 19,
+    "I-CREDITCARDNUMBER": 20,
+    "I-DATE": 21,
+    "I-DRIVERLICENSENUM": 22,
+    "I-EMAIL": 23,
+    "I-GIVENNAME": 24,
+    "I-IDCARDNUM": 25,
+    "I-PASSPORTNUM": 26,
+    "I-STREET": 27,
+    "I-SURNAME": 28,
+    "I-TAXNUM": 29,
+    "I-TELEPHONENUM": 30,
+    "I-TIME": 31,
+    "I-ZIPCODE": 32
+  },
+  "unique_labels": [
+    "AGE",
+    "BUILDINGNUM",
+    "CITY",
+    "CREDITCARDNUMBER",
+    "DATE",
+    "DRIVERLICENSENUM",
+    "EMAIL",
+    "GIVENNAME",
+    "IDCARDNUM",
+    "PASSPORTNUM",
+    "STREET",
+    "SURNAME",
+    "TAXNUM",
+    "TELEPHONENUM",
+    "TIME",
+    "ZIPCODE"
+  ]
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51b296ad6d7359d18118e4269b6e7d74e8e61d5d2d495935ffb12a7a1f92a835
+size 1109937788

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f59925fcb90c92b894cb93e51bb9b4a6105c5c249fe54ce1c704420ac39b81af
+size 17082756

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}