Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

README.md +199 -0
config.json +52 -0
merges.txt +0 -0
model.safetensors +3 -0
special_tokens_map.json +15 -0
test_results.json +89 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
training_args.bin +3 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+base_model: roberta-base
+language:
+  - en
+license: apache-2.0
+tags:
+  - text
+  - token-classification
+  - named-entity-recognition
+  - encoder-only
+  - roberta
+  - fine-tuned
+  - domain-specific
+metrics:
+  - seqeval
+model-index:
+  - name: roberta-base-group-mention-detector-uk-manifestos
+    results:
+    - task:
+        type: token-classification
+        name: Token classification
+      dataset:
+        type: custom
+        name: custom human-labeled sequence annotation dataset (see model card details)
+      metrics:
+        - type: seqeval
+          name: social group (seqeval)
+          value: 0.7129859387923904
+        - type: seqeval
+          name: political group (seqeval)
+          value: 0.9230769230769231
+        - type: seqeval
+          name: political institution (seqeval)
+          value: 0.711779448621554
+        - type: seqeval
+          name: organization, public institution, or collective actor (seqeval)
+          value: 0.6354009077155824
+        - type: seqeval
+          name: implicit social group reference (seqeval)
+          value: 0.6906077348066298
+---
+# roberta-base-group-mention-detector-uk-manifestos
+<!-- Provide a quick summary of what the model is/does. -->
+[roberta-base](https://huggingface.co/roberta-base) model finetuned for social group mention detectin in political texts
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+Token classification model for (social) group mention detection based on [Licht & Sczepanski (2025)](https://doi.org/10.31219/osf.io/ufb96)
+This token classification has been finetuned on human sequence annotations of sentences of British parties' election manifestos for the following entity types:
+- social group
+- implicit social group reference
+- political group
+- political institution
+- organization, public institution, or collective actor
+Please refer to [Licht & Sczepanski (2025)](https://doi.org/10.31219/osf.io/ufb96) for details.
+- **Developed by:** Hauke Licht
+- **Model type:** roberta
+- **Language(s) (NLP):** ['en']
+- **License:** apache-2.0
+- **Finetuned from model:** roberta-base
+- **Funded by:** *Center for Comparative and International Studies* of the ETH Zurich and the University of Zurich  and  the *Deutsche Forschungsgemeinschaft* (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2126/1 – 390838866
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/haukelicht/group_mention_detection/release/
+- **Paper:** https://doi.org/10.31219/osf.io/ufb96
+- **Demo:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+- Evaluation of the classifier in held-out data shows that it makes mistakes (see section *Results*).
+- The model has been finetuned only on human-annotated labeled sentences sampled from British parties party manifestos. Applying the classifier in other domains can lead to higher error rates than those reported in section *Results* below.
+- The data used to finetune the model come from human annotators. Human annotators can be biased and factors like gender and social background can impact their annotations judgments. This may lead to bias in the detection of specific social groups.
+#### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+- Users who want to apply the model outside its training data domain (British parties' election programs) should evaluate its performance in the target data.
+- Users who want to apply the model outside its training data domain (British parties' election programs) should contuninue to finetune this model on labeled data.
+### How to Get Started with the Model
+Use the code below to get started with the model.
+```pyhton
+from transformers import pipeline
+model_id = "haukelicht/roberta-base-group-mention-detector-uk-manifestos"
+classifier = pipeline(task="ner", model=model_id, aggregation_strategy="simple")
+text = "Our party fights for the deprived and the vulnerable in our country."
+annotations = classifier(text)
+print(annotations)
+# get annotations' character start and end indexes
+locations = [(anno['start'], anno['end']) for anno in annotations]
+locations
+# index the source text using first annotation as an example
+loc = locations[0]
+text[slice(*loc)]
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The train, dev, and test splits used for model finetuning and evaluation are available on Github: https://github.com/haukelicht/group_mention_detection/release/splits
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Training Hyperparameters
+- epochs: 6
+- learning rate: 5e-05
+- batch size: 16
+- weight decay: 0.3
+- warmup ratio: 0.1
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+The train, dev, and test splits used for model finetuning and evaluation are available on Github: https://github.com/haukelicht/group_mention_detection/release/splits
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+- seq-eval F1: strict seqeuence labeling evaluation metric per CoNLL-2000 shared task based on https://github.com/chakki-works/seqeval
+- "soft" seq-eval F1: a more lenient seqeuence labeling evaluation metric that reports span level average performance suzmmarized across examples per https://github.com/haukelicht/soft-seqeval
+- sentence-level F1: binary measure of detection performance considering a sentence a positive example/prediction if it contains at least one enttiy to of the given type
+### Results
+|                                                  type |  seq-eval F1  |  soft seq-eval  F1  |  sentence level  F1  |
+|-------------------------------------------------------|---------------|---------------------|----------------------|
+|                                          social group |     0.713     |        0.766        |        0.933         |
+|                                       political group |     0.923     |        0.937        |        0.991         |
+|                                 political institution |     0.712     |        0.723        |        0.951         |
+| organization, public institution, or collective actor |     0.635     |        0.605        |        0.932         |
+|                       implicit social group reference |     0.691     |        0.593        |        0.950         |
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+Licht, H., & Sczepanski, R. (2025). Detecting Group Mentions in Political Rhetoric: A Supervised Learning Approach. forthcoming in *British Journal of Political Science*. Preprint available at [OSF](https://doi.org/10.31219/osf.io/ufb96)
+## More Information
+https://github.com/haukelicht/group_mention_detection/release
+## Model Card Contact
+hauke.licht@uibk.ac.at

config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "architectures": [
+    "RobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "I-social group",
+    "2": "I-political group",
+    "3": "I-political institution",
+    "4": "I-organization, public institution, or collective actor",
+    "5": "I-implicit social group reference",
+    "6": "B-social group",
+    "7": "B-political group",
+    "8": "B-political institution",
+    "9": "B-organization, public institution, or collective actor",
+    "10": "B-implicit social group reference"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-implicit social group reference": 10,
+    "B-organization, public institution, or collective actor": 9,
+    "B-political group": 7,
+    "B-political institution": 8,
+    "B-social group": 6,
+    "I-implicit social group reference": 5,
+    "I-organization, public institution, or collective actor": 4,
+    "I-political group": 2,
+    "I-political institution": 3,
+    "I-social group": 1,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.3",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8ed76ec30828f1858cdc3b0732e69813a4dcb5ef280c07bc173f1fb678895589
+size 496277924

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

test_results.json ADDED Viewed

	@@ -0,0 +1,89 @@

+{
+  "test_loss": 0.23216606676578522,
+  "test_seqeval-macro_f1": 0.734770190602616,
+  "test_seqeval-macro_precision": 0.710234593555044,
+  "test_seqeval-macro_recall": 0.761660613652209,
+  "test_seqeval-micro_f1": 0.730981256890849,
+  "test_seqeval-micro_precision": 0.7041954328199681,
+  "test_seqeval-micro_recall": 0.7598853868194843,
+  "test_seqeval-social group_f1": 0.7129859387923904,
+  "test_seqeval-social group_precision": 0.6798107255520505,
+  "test_seqeval-social group_recall": 0.7495652173913043,
+  "test_seqeval-political group_f1": 0.9230769230769231,
+  "test_seqeval-political group_precision": 0.9139072847682119,
+  "test_seqeval-political group_recall": 0.9324324324324325,
+  "test_seqeval-organization, public institution, or collective actor_f1": 0.6354009077155824,
+  "test_seqeval-organization, public institution, or collective actor_precision": 0.5982905982905983,
+  "test_seqeval-organization, public institution, or collective actor_recall": 0.6774193548387096,
+  "test_seqeval-political institution_f1": 0.711779448621554,
+  "test_seqeval-political institution_precision": 0.6977886977886978,
+  "test_seqeval-political institution_recall": 0.7263427109974424,
+  "test_seqeval-implicit social group reference_f1": 0.6906077348066298,
+  "test_seqeval-implicit social group reference_precision": 0.6613756613756614,
+  "test_seqeval-implicit social group reference_recall": 0.7225433526011561,
+  "test_softseqeval-macro_f1": 0.7249429563665621,
+  "test_softseqeval-macro_precision": 0.7351167276186958,
+  "test_softseqeval-macro_recall": 0.7272877579870108,
+  "test_softseqeval-micro_f1": 0.8150108141222834,
+  "test_softseqeval-micro_precision": 0.8276780140092342,
+  "test_softseqeval-micro_recall": 0.8213445621409171,
+  "test_softseqeval-social group_f1": 0.7663127580837202,
+  "test_softseqeval-social group_precision": 0.7806733266733267,
+  "test_softseqeval-social group_recall": 0.7762571424302194,
+  "test_softseqeval-political group_f1": 0.9366854822737176,
+  "test_softseqeval-political group_precision": 0.9389978213507625,
+  "test_softseqeval-political group_recall": 0.9393246187363836,
+  "test_softseqeval-organization, public institution, or collective actor_f1": 0.6052204342608383,
+  "test_softseqeval-organization, public institution, or collective actor_precision": 0.622174122174122,
+  "test_softseqeval-organization, public institution, or collective actor_recall": 0.600578403078403,
+  "test_softseqeval-political institution_f1": 0.7231932152815058,
+  "test_softseqeval-political institution_precision": 0.7340427818983123,
+  "test_softseqeval-political institution_recall": 0.7262908022501702,
+  "test_softseqeval-implicit social group reference_f1": 0.593302891933029,
+  "test_softseqeval-implicit social group reference_precision": 0.5996955859969558,
+  "test_softseqeval-implicit social group reference_recall": 0.5939878234398781,
+  "test_doclevel-micro_precision": 0.9473684210526315,
+  "test_doclevel-micro_recall": 0.9473684210526315,
+  "test_doclevel-micro_f1": 0.9473684210526315,
+  "test_doclevel-social group_precision": 0.9330143540669856,
+  "test_doclevel-social group_recall": 0.9330143540669856,
+  "test_doclevel-social group_f1": 0.9330143540669856,
+  "test_doclevel-political group_precision": 0.9911141490088858,
+  "test_doclevel-political group_recall": 0.9911141490088858,
+  "test_doclevel-political group_f1": 0.9911141490088858,
+  "test_doclevel-organization, public institution, or collective actor_precision": 0.9316473000683527,
+  "test_doclevel-organization, public institution, or collective actor_recall": 0.9316473000683527,
+  "test_doclevel-organization, public institution, or collective actor_f1": 0.9316473000683527,
+  "test_doclevel-political institution_precision": 0.950786056049214,
+  "test_doclevel-political institution_recall": 0.950786056049214,
+  "test_doclevel-political institution_f1": 0.950786056049214,
+  "test_doclevel-implicit social group reference_precision": 0.9501025290498974,
+  "test_doclevel-implicit social group reference_recall": 0.9501025290498974,
+  "test_doclevel-implicit social group reference_f1": 0.9501025290498974,
+  "test_wordlevel-accuracy": 0.9565914819785903,
+  "test_wordlevel-macro_f1": 0.835518946077031,
+  "test_wordlevel-macro_precision": 0.826862620530972,
+  "test_wordlevel-macro_recall": 0.8452987076293729,
+  "test_wordlevel-O_f1": 0.9787854680456113,
+  "test_wordlevel-O_precision": 0.9808290942221547,
+  "test_wordlevel-O_recall": 0.9767503402389234,
+  "test_wordlevel-social group_f1": 0.8289473684210527,
+  "test_wordlevel-social group_precision": 0.8054474708171206,
+  "test_wordlevel-social group_recall": 0.8538597525044196,
+  "test_wordlevel-political group_f1": 0.9562563580874873,
+  "test_wordlevel-political group_precision": 0.9475806451612904,
+  "test_wordlevel-political group_recall": 0.9650924024640657,
+  "test_wordlevel-organization, public institution, or collective actor_f1": 0.7248968363136176,
+  "test_wordlevel-organization, public institution, or collective actor_precision": 0.7140921409214093,
+  "test_wordlevel-organization, public institution, or collective actor_recall": 0.7360335195530726,
+  "test_wordlevel-political institution_f1": 0.8176855895196506,
+  "test_wordlevel-political institution_precision": 0.8406285072951739,
+  "test_wordlevel-political institution_recall": 0.79596174282678,
+  "test_wordlevel-implicit social group reference_f1": 0.7065420560747664,
+  "test_wordlevel-implicit social group reference_precision": 0.6725978647686833,
+  "test_wordlevel-implicit social group reference_recall": 0.7440944881889764,
+  "test_runtime": 6.5587,
+  "test_samples_per_second": 223.061,
+  "test_steps_per_second": 7.014,
+  "epoch": 6.0
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50264": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9ad307dab848d137b796c404bf75f895a963afd5f5a296b4cc3932da8c6b942f
+size 5777

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff