|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- FacebookAI/roberta-large |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Classification of sentiments towards AI perceptions from snippets of news-related texts. |
|
This model classifies sentiments towards AI perceptions "Positive" (2), "Negative" (0), or "Neutral/Mixed" (1) |
|
|
|
## Training Dataset |
|
|
|
This was fine-tuned on the [Long-Term Trends of Public Perception of Artificial Intelligence (AI)](https://ojs.aaai.org/index.php/aaai/article/view/10635) with [`FacebookAI/roberta-large`](https://huggingface.co/FacebookAI/roberta-large) serving as the base model. |
|
|
|
The Long-Term Trends of Public Perception of Artificial Intelligence (AI) is a dataset that captures nearly 30 years of public perceptions regarding AI. Annotators labeled perceptions based on 5,685 paragraphs extracted from New York Times (NYT) articles related to AI, spanning 1986 to 2016. Perceptions were condensed to "Positive", "Negative", or "Neutral/Mixed". |
|
|
|
2,000 data points were sampled through continous-time series clustering and were used to finetune the data in order to mimic the constraints of annotation and finetuning in policy-related studies. (To understand why we do that, refer to our paper at the reference below) |
|
|
|
## How to use model |
|
|
|
Here is some source code to get you started on using the model to classify the sentiments. |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
|
from tqdm import tqdm |
|
import pandas as pd |
|
|
|
|
|
def classify_tweets(df, text_col, model, tokenizer): |
|
df[text_col]=df[text_col].astype(str) |
|
device = 0 if torch.cuda.is_available() else -1 # Use GPU if available |
|
classifier = pipeline( |
|
"text-classification", |
|
model=model, |
|
tokenizer=tokenizer, |
|
device=device, |
|
truncation=True, # Ensures inputs don't exceed max length |
|
max_length=512, # Manually set to avoid exceeding model's limit |
|
padding="max_length" # Ensures all inputs have the same length |
|
) |
|
|
|
outcomes, probs, pred_labels = [], [], [] |
|
for text in tqdm(df[text_col]): # Fixed tqdm syntax |
|
preds = classifier(text, return_all_scores=True) |
|
outcomes.append(preds) |
|
|
|
# Extract probabilities and predicted label |
|
label_scores = {entry['label']: entry['score'] for entry in preds[0]} |
|
probs.append(list(label_scores.values())) |
|
pred_labels.append(max(label_scores, key=label_scores.get)) |
|
df["predicted_label"] = pred_labels |
|
return df |
|
|
|
model_name="cja5553/AI-perception-sentiments-roberta-large" |
|
label2id={"negative":0,"neutral/mixed":1,"positive":2} |
|
id2label = {v: k for k, v in label2id.items()} |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label2id), |
|
label2id=label2id,id2label=id2label, |
|
use_auth_token=False).to("cuda") |
|
tokenizer = AutoTokenizer.from_pretrained(model_name,use_auth_token=False) |
|
text_col="text" # change text column accordingly |
|
df_with_classification=classify_tweets(df, text_col, model, tokenizer) |
|
``` |
|
|
|
## Citation |
|
If you find this model useful, please cite the following paper: |
|
``` |
|
@inproceedings{ |
|
author={Charles Alba, Benjamin C Warner, Akshar Saxena, Jiaxin Huang, Ruopeng An}, |
|
title={Towards Robust Sentiment Analysis of Temporally-Sensitive Policy-Related Online Text}, |
|
year={2025}, |
|
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 4: Student Research Workshop.}, |
|
url={https://aclanthology.org/2025.acl-srw.70/}} |
|
``` |
|
|
|
|
|
|
|
## Code |
|
|
|
Code used to train these models are available on GitHub at [github.com/cja5553/ctscams](https://github.com/cja5553/ctscams) |
|
|
|
## Questions? |
|
contact me at alba@wustl.edu |