Financial Sentiment Classifier using SBERT

Extended Description

Overview

In the realm of financial decision-making, timely and accurate information is crucial. One of the key aspects of understanding the market and its behavior is analyzing the sentiment of financial news, articles, and social media posts. Positive or negative sentiment in financial news can have significant impacts on stock prices, investment decisions, and even market trends.

This model, the Financial Sentiment Classifier using SBERT, is designed to classify the sentiment of financial news headlines into three categories: positive, negative, and neutral. The model utilizes Sentence-BERT (SBERT), a transformer-based model, which creates dense and rich embeddings from sentences. These embeddings are then passed through a RandomForestClassifier, which classifies the sentiment based on historical data of financial news headlines.

SBERT is an ideal choice for this task because it is specifically fine-tuned for sentence-level semantic understanding. This model is capable of capturing subtle nuances in language and context, such as understanding market sentiment related to financial reports or stock performance, making it an excellent choice for the financial domain.

Intended Use

This model can be employed in a variety of financial applications, including but not limited to:

Automating sentiment analysis workflows: Automatically categorize financial headlines from news sources, social media, or corporate press releases.
Market prediction: Use sentiment data to predict market movements, informing stock trading decisions.
Investor sentiment monitoring: Track sentiment over time to gauge how the market or the public perceives a particular financial entity or event.
Financial news aggregation: Classify news articles in real-time for news aggregation platforms to filter positive, neutral, or negative content.

This model's flexibility makes it adaptable for real-time applications, including automated trading systems, financial monitoring tools, and market sentiment analysis platforms.

Model Details

The model comprises two major components:

Sentence-BERT (SBERT): This is a specialized variant of the BERT (Bidirectional Encoder Representations from Transformers) model, designed to produce high-quality sentence embeddings. Unlike traditional BERT, which works on token-level representations, SBERT generates fixed-size embeddings for entire sentences or documents. This ability makes it a powerful tool for understanding the meaning of financial statements and market-relevant news.
RandomForestClassifier: Once the sentences are transformed into embeddings using SBERT, the model uses a RandomForestClassifier to perform sentiment classification. The RandomForest model is a robust machine learning algorithm that combines the predictions of multiple decision trees to deliver accurate results. In this case, the classifier predicts the sentiment of the sentence based on the embeddings generated by SBERT.

The sentiment classification system is trained using a labeled dataset of financial news headlines, where each headline has been annotated with a sentiment label (positive, negative, or neutral). The model is fine-tuned to recognize patterns that relate to sentiment in financial language.

Data & Preprocessing

This model was trained on a custom dataset consisting of financial news headlines, though it can be fine-tuned with your own data to improve performance for specific use cases.

The preprocessing steps included:

Tokenizing the text: Breaking each headline into individual words or tokens.
Converting text to embeddings: Each headline was passed through the SBERT model, generating a dense vector (embedding) that captures the semantic meaning of the sentence.
Label Encoding: The sentiment labels (positive, negative, neutral) were encoded into numeric values (0, 1, 2) for the classifier to process.

Additionally, for fine-tuning the model for your own data, the preprocessing step involves converting new financial headlines into embeddings and feeding them into the RandomForest model.

Model Evaluation

On the test data, the model achieves an accuracy of 61%, with an F1-score of 0.61. Not optimal, but acceptable in terms of the simplicity and few data the model is trained on.

Hyperparameters:

Number of Estimators (n_estimators): 200
Max Depth (max_depth): 20
Min Samples Split (min_samples_split): 5
Min Samples Leaf (min_samples_leaf): 1
Random State (random_state): 42
Max Features (max_features): 'sqrt' (default value for RandomForest)

Classification Report:

Precision:
- Class 0: 0.66
- Class 1: 0.62
- Class 2: 0.55
Recall:
- Class 0: 0.52
- Class 1: 0.80
- Class 2: 0.52
F1-Score:
- Class 0: 0.58
- Class 1: 0.70
- Class 2: 0.54
Overall Accuracy: 0.61
Macro Average: 0.61 (Precision, Recall, F1-Score)
Weighted Average: 0.61 (Precision, Recall, F1-Score)

Usage

To use the model, first install the necessary dependencies:

pip install sentence-transformers scikit-learn