File size: 6,580 Bytes
807c6a5 ad2448e 807c6a5 c633121 3f379f4 c633121 3f379f4 c633121 3f379f4 787d41e 07632b7 3f379f4 787d41e 3f379f4 07632b7 c633121 7d06ac6 ad2448e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
---
license: mit
license_name: joke
license_link: LICENSE
datasets:
- ConquestAce/spotify-songs
pipeline_tag: audio-classification
tags:
- music
- spotify
- machine-learning
- music-prediction
- data-science
- regression
- classification
- popularity-analysis`
---
# 🎵 Spotify Song Popularity Prediction
Predict the popularity of a song based on its audio features and estimate potential Spotify royalties.
[](LICENSE)
---
## 📖 Project Overview
This project explores machine learning models to predict the popularity of songs using publicly available features such as danceability, energy, tempo, and valence. It also demonstrates a prototype pricing tool that estimates potential Spotify revenue based on predicted popularity.
Despite the challenges in accurately forecasting popularity due to time-evolving factors, our models show that **minimum popularity and expected revenue can be estimated** using machine learning techniques.
---
## 📊 Dataset
- **Source**:
- Spotify Web API
- Original Dataset (~114,000 songs) expanded to **~2 million songs**
- **Features**:
- Acoustic features (energy, danceability, valence, etc.)
- Target variable: `popularity` (integer from 0–100)
---
## 🔬 Methods
- **Data Cleaning and Preparation**:
- Removed zero-popularity entries, duplicates (~8% of rows), and outliers
- Standardized genres using clustering
- **Exploratory Data Analysis (EDA)**:
- Analyzed distributions, correlations, and cumulative trends
- **Modeling**:
- Linear Regression, Ridge Regression
- Decision Tree, Random Forest, AdaBoost (best recall: 86% on popular songs)
- XGBoost (binning) and Neural Networks
- **Revenue Estimation**:
- Quadratic regression fit between predicted popularity and play counts
- Prototype pricing tool predicting Spotify revenue for songs
---
## 🏆 Results
| Model | Highlights |
|-------------------------|----------------------------------------------|
| Linear/Ridge Regression | Poor fit due to complex, noisy data |
| Random Forest | Best overall stability (recall on populars) |
| AdaBoost (weighted) | **Best performance**: 86% recall for popular songs |
| Neural Networks | Showed challenges due to "popularity" instability |
- Predicted revenue for a song with **popularity 55** ≈ **\$357,000 CAD**.
- Pricing tool demonstrated practical viability despite prediction limitations.
---
## 📈 Example
Predicting a song’s revenue based on its feature vector:
```python
# Example (simplified)
predicted_popularity = model.predict(features)
predicted_revenue = pricing_function(predicted_popularity)
```
---
## 🚀 How to Run
```bash
# Clone this repo
git clone https://huggingface.co/username/spotify-popularity-prediction
# Install dependencies
pip install -r requirements.txt
# Train or evaluate models
python train_models.py
python evaluate_models.py
# Predict song revenue
python pricing_tool.py
```
(Adaptable scripts for different model types: AdaBoost, Random Forest, Neural Net.)
---
## 🤔 Limitations
- Song features alone are **not sufficient** for high-accuracy predictions.
- "Popularity" is a **time-dependent** and **dynamic** metric.
- Genre diversity (>5000 unique genres) complicated modeling.
---
## 🧠 Future Work
- Predict **play count** directly instead of popularity.
- Fine-tune **XGBoost** and **deep neural networks** on larger datasets.
- Integrate **time-evolution models** for dynamic popularity changes.
- Improve genre classification with unsupervised learning (e.g., genre embeddings).
---
# `popularity_predictor.pth`
This neural network model is extremely weak. I was not good at data science when I made this
## Iterations
**null**:
<details>
<summary><b>Trained on 500 Epoch with 2.1 million song data from Spotify Database</b></summary>
```
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Split the data into features and target variable
X = df[numerical_features[:-1]].values # all except popularity
y = df['popularity'].values
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).view(-1, 1) # shape to (N, 1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).view(-1, 1)
# Define the neural network model
class PopularityPredictor(nn.Module):
def __init__(self):
super(PopularityPredictor, self).__init__()
self.fc1 = nn.Linear(X_train.shape[1], 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 32)
self.fc4 = nn.Linear(32, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# Create an instance of the model
model = PopularityPredictor()
# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Train the model
num_epochs = 100
for epoch in range(num_epochs):
model.train()
optimizer.zero_grad()
# Forward pass
outputs = model(X_train_tensor)
loss = criterion(outputs, y_train_tensor)
# Backward pass and optimization
loss.backward()
optimizer.step()
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# Evaluate the model
model.eval()
with torch.no_grad():
predicted = model(X_test_tensor)
```
</details>
## 📚 Citation
If you use this project, please cite:
```bibtex
@misc{bhuiyan2024spotify,
title={Spotify Song Popularity Prediction},
author={Ashiful Bhuiyan, Blanca Fernández Méndez, Nazanin Ghelichi, Pavle Curcin},
year={2024},
institution={York University},
}
```
---
## 🧑💻 Authors
- Ashiful Bhuiyan
- Blanca Elvira Fernández Méndez
- Nazanin Ghelichi
- Pavle Curcin
---
## 📄 License
This project is licensed under the [MIT License](LICENSE).
## 🏷 Tags
`#spotify` `#machine-learning` `#music-prediction` `#data-science` `#regression` `#classification` `#popularity-analysis` |