--- license: mit license_name: joke license_link: LICENSE datasets: - ConquestAce/spotify-songs pipeline_tag: audio-classification tags: - music - spotify - machine-learning - music-prediction - data-science - regression - classification - popularity-analysis` --- # 🎵 Spotify Song Popularity Prediction Predict the popularity of a song based on its audio features and estimate potential Spotify royalties. [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) --- ## 📖 Project Overview This project explores machine learning models to predict the popularity of songs using publicly available features such as danceability, energy, tempo, and valence. It also demonstrates a prototype pricing tool that estimates potential Spotify revenue based on predicted popularity. Despite the challenges in accurately forecasting popularity due to time-evolving factors, our models show that **minimum popularity and expected revenue can be estimated** using machine learning techniques. --- ## 📊 Dataset - **Source**: - Spotify Web API - Original Dataset (~114,000 songs) expanded to **~2 million songs** - **Features**: - Acoustic features (energy, danceability, valence, etc.) - Target variable: `popularity` (integer from 0–100) --- ## 🔬 Methods - **Data Cleaning and Preparation**: - Removed zero-popularity entries, duplicates (~8% of rows), and outliers - Standardized genres using clustering - **Exploratory Data Analysis (EDA)**: - Analyzed distributions, correlations, and cumulative trends - **Modeling**: - Linear Regression, Ridge Regression - Decision Tree, Random Forest, AdaBoost (best recall: 86% on popular songs) - XGBoost (binning) and Neural Networks - **Revenue Estimation**: - Quadratic regression fit between predicted popularity and play counts - Prototype pricing tool predicting Spotify revenue for songs --- ## 🏆 Results | Model | Highlights | |-------------------------|----------------------------------------------| | Linear/Ridge Regression | Poor fit due to complex, noisy data | | Random Forest | Best overall stability (recall on populars) | | AdaBoost (weighted) | **Best performance**: 86% recall for popular songs | | Neural Networks | Showed challenges due to "popularity" instability | - Predicted revenue for a song with **popularity 55** ≈ **\$357,000 CAD**. - Pricing tool demonstrated practical viability despite prediction limitations. --- ## 📈 Example Predicting a song’s revenue based on its feature vector: ```python # Example (simplified) predicted_popularity = model.predict(features) predicted_revenue = pricing_function(predicted_popularity) ``` --- ## 🚀 How to Run ```bash # Clone this repo git clone https://huggingface.co/username/spotify-popularity-prediction # Install dependencies pip install -r requirements.txt # Train or evaluate models python train_models.py python evaluate_models.py # Predict song revenue python pricing_tool.py ``` (Adaptable scripts for different model types: AdaBoost, Random Forest, Neural Net.) --- ## 🤔 Limitations - Song features alone are **not sufficient** for high-accuracy predictions. - "Popularity" is a **time-dependent** and **dynamic** metric. - Genre diversity (>5000 unique genres) complicated modeling. --- ## 🧠 Future Work - Predict **play count** directly instead of popularity. - Fine-tune **XGBoost** and **deep neural networks** on larger datasets. - Integrate **time-evolution models** for dynamic popularity changes. - Improve genre classification with unsupervised learning (e.g., genre embeddings). --- # `popularity_predictor.pth` This neural network model is extremely weak. I was not good at data science when I made this ## Iterations **null**:
Trained on 500 Epoch with 2.1 million song data from Spotify Database ``` import torch import torch.nn as nn import torch.optim as optim from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd # Split the data into features and target variable X = df[numerical_features[:-1]].values # all except popularity y = df['popularity'].values # Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Standardize the features scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Convert to PyTorch tensors X_train_tensor = torch.FloatTensor(X_train) y_train_tensor = torch.FloatTensor(y_train).view(-1, 1) # shape to (N, 1) X_test_tensor = torch.FloatTensor(X_test) y_test_tensor = torch.FloatTensor(y_test).view(-1, 1) # Define the neural network model class PopularityPredictor(nn.Module): def __init__(self): super(PopularityPredictor, self).__init__() self.fc1 = nn.Linear(X_train.shape[1], 128) self.fc2 = nn.Linear(128, 64) self.fc3 = nn.Linear(64, 32) self.fc4 = nn.Linear(32, 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) x = self.fc3(x) return x # Create an instance of the model model = PopularityPredictor() # Define the loss function and optimizer criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Train the model num_epochs = 100 for epoch in range(num_epochs): model.train() optimizer.zero_grad() # Forward pass outputs = model(X_train_tensor) loss = criterion(outputs, y_train_tensor) # Backward pass and optimization loss.backward() optimizer.step() if (epoch+1) % 10 == 0: print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}') # Evaluate the model model.eval() with torch.no_grad(): predicted = model(X_test_tensor) ```
## 📚 Citation If you use this project, please cite: ```bibtex @misc{bhuiyan2024spotify, title={Spotify Song Popularity Prediction}, author={Ashiful Bhuiyan, Blanca Fernández Méndez, Nazanin Ghelichi, Pavle Curcin}, year={2024}, institution={York University}, } ``` --- ## 🧑‍💻 Authors - Ashiful Bhuiyan - Blanca Elvira Fernández Méndez - Nazanin Ghelichi - Pavle Curcin --- ## 📄 License This project is licensed under the [MIT License](LICENSE). ## 🏷 Tags `#spotify` `#machine-learning` `#music-prediction` `#data-science` `#regression` `#classification` `#popularity-analysis`