File size: 2,801 Bytes
25b7d34
 
 
32b5747
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f55983
 
 
 
9e3090a
8f55983
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: openrail
---

# StackOverflow-RoBERTa-base for Sentiment Analysis on Software Engineering Texts

This is a RoBERTa-base model for sentiment analysis on software engineering texts. It is re-finetuned from [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) with [StackOverflow4423](https://arxiv.org/abs/1709.02984) dataset. You can access the demo [here](https://huggingface.co/spaces/Cloudy1225/stackoverflow-sentiment-analysis).

## Example of Pipeline


```python
from transformers import pipeline

MODEL = 'Cloudy1225/stackoverflow-roberta-base-sentiment'
sentiment_task = pipeline(task="sentiment-analysis", model=MODEL)
sentiment_task(["Excellent, happy to help!",
                "This can probably be done using JavaScript.",
                "Yes, but it's tricky, since datetime parsing in SQL is a pain in the neck."])
```

    [{'label': 'positive', 'score': 0.9997847676277161},
     {'label': 'neutral', 'score': 0.999783456325531},
     {'label': 'negative', 'score': 0.9996368885040283}]



## Example of Classification


```python
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def preprocess(text):
    """Preprocess text (username and link placeholders)"""
    new_text = []
    for t in text.split(' '):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return ' '.join(new_text).strip()

MODEL = 'Cloudy1225/stackoverflow-roberta-base-sentiment'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

text = "Excellent, happy to help!"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
print("negative", scores[0])
print("neutral", scores[1])
print("positive", scores[2])
```

    negative 0.00015578205
    neutral 5.9470447e-05
    positive 0.99978495



## Acknowledgments

This project was developed as part of the **Software Engineering and Computing III** course at Software Institute, Nanjing University in Spring 2023. For more insights into sentiment analysis on software engineering texts, you can refer to the following paper:

```
@inproceedings{sun2022incorporating,
  title={Incorporating Pre-trained Transformer Models into TextCNN for Sentiment Analysis on Software Engineering Texts},
  author={Sun, Kexin and Shi, Xiaobo and Gao, Hui and Kuang, Hongyu and Ma, Xiaoxing and Rong, Guoping and Shao, Dong and Zhao, Zheng and Zhang, He},
  booktitle={Proceedings of the 13th Asia-Pacific Symposium on Internetware},
  pages={127--136},
  year={2022}
}
```