File size: 6,580 Bytes
807c6a5
ad2448e
807c6a5
 
 
 
 
 
 
c633121
 
 
 
 
 
 
3f379f4
 
c633121
3f379f4
c633121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f379f4
 
 
 
787d41e
07632b7
3f379f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
787d41e
3f379f4
07632b7
c633121
 
7d06ac6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad2448e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
license: mit
license_name: joke
license_link: LICENSE
datasets:
- ConquestAce/spotify-songs
pipeline_tag: audio-classification
tags:
- music
- spotify
- machine-learning
- music-prediction
- data-science
- regression
- classification
- popularity-analysis`
---

# 🎵 Spotify Song Popularity Prediction

Predict the popularity of a song based on its audio features and estimate potential Spotify royalties.

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

---

## 📖 Project Overview

This project explores machine learning models to predict the popularity of songs using publicly available features such as danceability, energy, tempo, and valence. It also demonstrates a prototype pricing tool that estimates potential Spotify revenue based on predicted popularity.

Despite the challenges in accurately forecasting popularity due to time-evolving factors, our models show that **minimum popularity and expected revenue can be estimated** using machine learning techniques.

---

## 📊 Dataset

- **Source**:  
  - Spotify Web API
  - Original Dataset (~114,000 songs) expanded to **~2 million songs**
- **Features**:  
  - Acoustic features (energy, danceability, valence, etc.)
  - Target variable: `popularity` (integer from 0–100)

---
 
## 🔬 Methods

- **Data Cleaning and Preparation**:
  - Removed zero-popularity entries, duplicates (~8% of rows), and outliers
  - Standardized genres using clustering
- **Exploratory Data Analysis (EDA)**:
  - Analyzed distributions, correlations, and cumulative trends
- **Modeling**:
  - Linear Regression, Ridge Regression
  - Decision Tree, Random Forest, AdaBoost (best recall: 86% on popular songs)
  - XGBoost (binning) and Neural Networks
- **Revenue Estimation**:
  - Quadratic regression fit between predicted popularity and play counts
  - Prototype pricing tool predicting Spotify revenue for songs

---

## 🏆 Results

| Model                  | Highlights                                    |
|-------------------------|----------------------------------------------|
| Linear/Ridge Regression | Poor fit due to complex, noisy data          |
| Random Forest           | Best overall stability (recall on populars)  |
| AdaBoost (weighted)     | **Best performance**: 86% recall for popular songs |
| Neural Networks         | Showed challenges due to "popularity" instability |

- Predicted revenue for a song with **popularity 55****\$357,000 CAD**.
- Pricing tool demonstrated practical viability despite prediction limitations.

---

## 📈 Example

Predicting a song’s revenue based on its feature vector:

```python
# Example (simplified)
predicted_popularity = model.predict(features)
predicted_revenue = pricing_function(predicted_popularity)
```

---

## 🚀 How to Run

```bash
# Clone this repo
git clone https://huggingface.co/username/spotify-popularity-prediction

# Install dependencies
pip install -r requirements.txt

# Train or evaluate models
python train_models.py
python evaluate_models.py

# Predict song revenue
python pricing_tool.py
```

(Adaptable scripts for different model types: AdaBoost, Random Forest, Neural Net.)

---

## 🤔 Limitations

- Song features alone are **not sufficient** for high-accuracy predictions.
- "Popularity" is a **time-dependent** and **dynamic** metric.
- Genre diversity (>5000 unique genres) complicated modeling.

---

## 🧠 Future Work

- Predict **play count** directly instead of popularity.
- Fine-tune **XGBoost** and **deep neural networks** on larger datasets.
- Integrate **time-evolution models** for dynamic popularity changes.
- Improve genre classification with unsupervised learning (e.g., genre embeddings).

---

# `popularity_predictor.pth`

This neural network model is extremely weak. I was not good at data science when I made this

## Iterations
**null**: 

<details>
<summary><b>Trained on 500 Epoch with 2.1 million song data from Spotify Database</b></summary>
  
```
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd


# Split the data into features and target variable
X = df[numerical_features[:-1]].values  # all except popularity
y = df['popularity'].values

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).view(-1, 1)  # shape to (N, 1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).view(-1, 1)

# Define the neural network model
class PopularityPredictor(nn.Module):
    def __init__(self):
        super(PopularityPredictor, self).__init__()
        self.fc1 = nn.Linear(X_train.shape[1], 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create an instance of the model
model = PopularityPredictor()

# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model
model.eval()
with torch.no_grad():
    predicted = model(X_test_tensor)
    
```

</details>

## 📚 Citation

If you use this project, please cite:

```bibtex
@misc{bhuiyan2024spotify,
  title={Spotify Song Popularity Prediction},
  author={Ashiful Bhuiyan, Blanca Fernández Méndez, Nazanin Ghelichi, Pavle Curcin},
  year={2024},
  institution={York University},
}
```

---

## 🧑‍💻 Authors

- Ashiful Bhuiyan
- Blanca Elvira Fernández Méndez
- Nazanin Ghelichi
- Pavle Curcin

---
## 📄 License

This project is licensed under the [MIT License](LICENSE).

##  🏷 Tags
`#spotify` `#machine-learning` `#music-prediction` `#data-science` `#regression` `#classification` `#popularity-analysis`