MindHeal Assistant
A multi-approach emotional support conversation system using the Emotional Support Conversations (esconv) dataset.[1]
Github Repo
https://github.com/DukeAIPI540Spring2025Meowth/nlp-demo
Overview
This project implements three approaches to emotional support conversations:
- Naive Approach: Using a foundation model without special prompting, RAG, or finetuning
- Traditional ML Approach: Hidden Markov Model (HMM)
- Deep Learning Approach: Finetuned Llama-3.2-3B-Instruct model
Live Demo
The application is deployed on Digital Ocean: https://mindheal-assistant-7kfky.ondigitalocean.app/
Novelty and Contribution
- Unlike existing chatbot-based emotional support systems, our approach integrates three distinct methodologies to compare and contrast effectiveness.
- We introduce a new LLM-based evaluation framework using an LLM as a judge to provide structured scoring for emotional support conversations.
- Our HMM-based emotion tracking model enhances structured dialogue generation in a way that hasn't been widely explored for emotional support systems.
Dataset
We used the esconv dataset, a crowd-sourced collection of emotional support conversations between therapists and patients.
Ethical Considerations on the Dataset
The esconv dataset consists of anonymized conversations between therapists and patients. While it provides a valuable resource for studying emotional support strategies, several ethical considerations must be addressed:
- Bias and Representation: Since the dataset is anonymized, we do not have demographic information on the participants. This means we cannot ensure that it represents diverse populations across gender, race, socioeconomic status, or cultural backgrounds.
- Therapeutic Quality: The dataset captures a range of therapist responses, but without knowing the professional qualifications of the individuals involved, we cannot verify whether all responses align with best practices in mental health support.
- Potential for Misuse: As the dataset is used to train AI models, there is a risk that models may generate responses that appear empathetic but lack true understanding, which could be harmful in real-world mental health applications.
- Limitations in Crisis Scenarios: The dataset does not include structured intervention for crisis situations such as imminent self-harm or suicide. Therefore, models trained on this dataset should not be relied upon for urgent mental health support.
We acknowledge these challenges and emphasize that MindHeal Assistant is an educational tool rather than a replacement for professional mental health services. We encourage future work on datasets that include more structured, clinically verified responses while ensuring inclusivity and representation.
Technical Details
Fine-tuning
- Used torchtune with Low Rank Adaptation (LoRA) recipe for Llama-3.2-3B-Instruct
- Used LORA configuration (3B_lora_single_device.yaml) copied and adapted with
tune copy
command - Training performed on Google Colab with A100 GPU
- Fine-tuned for 5 epochs (~4-5 minutes per epoch)
- Converted to GGUF format using llama-cpp
- Applied quantization for model optimization to be runnable on CPUs
Hidden Markov Model (HMM)
- Combines HMM for emotion state tracking with ML classifiers for emotion and problem detection
- Uses TF-IDF vectorization with MultinomialNB for emotion classification
- Employs RandomForest classifier for problem type categorization
- Implements transition matrices between emotional states based on therapeutic progression
- Maintains a library of response templates for different strategies (Question, Reflection, Suggestion, Information, Reassurance)
- Response selection determined by current emotional state and conversation context
Evaluation (LLM-as-a-judge)
- Implements a criteria-based evaluation framework using an LLM as a judge
- Evaluates responses based on five key metrics:
- Technical Accuracy (1-5): Application of proper therapeutic techniques
- Structural Adherence (1-5): Following the ABCDE model in responses
- Empathetic Tone (1-5): Level of emotional validation vs. robotic phrasing
- Intervention Depth (1-5): Quality of follow-up questioning
- Clinical Safety (1-5): Detection of risk factors and implementation of proper protocols
- Compares performance across all three approaches (naive, traditional, and deep learning)
Results and Conclusion
Metric | Naive | ML | NN |
---|---|---|---|
Technical Accuracy | 3.515 | 2.465 | 2.445 |
Structural Adherence | 1.78 | 1.085 | 1.12 |
Empathetic Tone | 4.275 | 3.275 | 3.45 |
Intervention Depth | 2.475 | 1.66 | 1.66 |
Clinical Safety | 2.865 | 2.055 | 2.12 |
Explanation of Results:
- Naive Approach performed best in technical accuracy and empathetic tone, likely due to the foundation model's general-purpose conversational ability.
- ML (HMM-based) and NN struggled with technical accuracy, potentially due to difficulty in mapping structured techniques to responses.
- Structural adherence was low across all methods, with ML and NN slightly improving over the naive approach.
- Empathy scores were highest for the naive approach, but this could be due to a lack of structured emotional support strategies.
- Clinical safety scores were relatively low, indicating that no approach was fully adept at risk detection for sensitive topics like suicide intervention.
Presentation
For a detailed project overview, refer to our presentation.
Citation
[1] Liu, et al. (2021). Toward Emotional Support Dialog Systems. ACL.
Dataset
We used the esconv dataset, a crowd-sourced collection of emotional support conversations between therapists and patients.
- Downloads last month
- 31
Model tree for haran-nallasivan/meowth-nlp-demo-0.1_llama-3.2-3b-instruct_q5_k_m_gguf
Base model
meta-llama/Llama-3.2-3B-Instruct