BoardGPT: RAG-Powered Surgery Oral Board Simulator

This repository contains the code for a Retrieval-Augmented Generation (RAG) system designed to simulate surgery oral board examinations, providing an interactive training tool for surgical residents.

Introduction

Surgery oral board exams are a critical step in certification, evaluating a resident's clinical reasoning and decision-making under pressure. Unlike written exams that test recall, oral boards assess the ability to handle dynamic, unfolding clinical scenarios. Current Large Language Models (LLMs) often struggle with the niche knowledge, complex reasoning, and specific interaction style required for these exams. Furthermore, creating high-quality, specialized training data is resource-intensive. This project addresses these challenges by implementing a RAG pipeline that leverages expert-level case material to simulate realistic oral board interactions, providing targeted feedback on user responses. Our results show that the RAG approach successfully retrieves relevant clinical scenarios, forming the basis for an effective simulation.

Data Source ("Training Data")

The knowledge base for this RAG system is built from 98 high-quality simulated oral board scenarios created by surgical education experts. The data were sourced from reliable educational materials, which include clinical scenarios followed by a series of questions and answers based on the initial scenario.

Preprocessing: The oral board scenarios manually cleaned and structured. These dialogues were then processed from .docx files into a structured Pandas DataFrame where each row represents a single Question-Answer turn within a specific clinical case (identified by case_id and clinical_presentation). See src/data_processing.py for details.
Knowledge Base Construction: Unlike traditional model fine-tuning, RAG uses the entire processed dataset as its knowledge source. Therefore, a standard train/validation/test split was not performed. The 98 processed cases form the complete knowledge base from which the retriever component draws information during the simulation.
Data Availability: The raw .docx transcript files are not included in this repository to respect copyright and privacy. However, the processed data structure and methodology are detailed here, and access to the original source may be available through appropriate channels or upon reasonable request for verification purposes.

RAG System Setup ("Training Method")

This project utilizes a Retrieval-Augmented Generation (RAG) approach rather than fine-tuning a base LLM. This decision was driven by the limited availability of large-scale, structured datasets specific to the oral board format and the desire to ground the simulation firmly in expert-validated case material.

The RAG pipeline consists of several key components implemented in the src/ directory:

ClinicalCaseProcessor: Takes the structured DataFrame of Q&A turns and processes it into a format suitable for retrieval. It groups turns by case and generates a semantic embedding for a summary of each case using the all-MiniLM-L6-v2 sentence-transformer model via the sentence-transformers library. This model was chosen for its balance of performance and efficiency in capturing semantic meaning for retrieval. The processed data, including embeddings, is saved as a Hugging Face Dataset.
ClinicalCaseRetriever: Takes a user's query (e.g., "pediatric appendicitis") and uses the same all-MiniLM-L6-v2 model to generate a query embedding. It calculates the cosine similarity between the query embedding and the pre-computed case embeddings to find and return the most relevant clinical case(s) from the knowledge base.
AnswerEvaluator: Employs a separate LLM (meta-llama/Llama-3.2-3B-Instruct) acting as a judge. Given the user's response, the expected ("ground truth") answer from the retrieved case, and the clinical context, it evaluates the user's answer based on a predefined rubric (Correct/Partially Correct/Incorrect) and provides textual feedback. See the "Prompt Format" section for details.
OralExamSimulator: Orchestrates the entire process. It uses the ClinicalCaseRetriever to select a case based on user input, presents questions from the case sequentially, collects the user's answers, passes the answer and expected answer to the AnswerEvaluator, and relays the feedback to the user.

Evaluation

Evaluation focused primarily on the effectiveness of the retrieval component, as the quality of the simulation hinges on retrieving the correct, relevant clinical case. We also performed qualitative comparisons of interaction quality using RAG versus purely synthetic generation. To quantitatively assess the retriever, we created a benchmark task using 5 representative clinical queries (e.g., "appendix inflammation in a child", "perforation of the esophagus") with known corresponding "gold standard" case IDs from our knowledge base. We then employed standard information retrieval metrics to evaluate the ClinicalCaseRetriever's performance in returning the correct case within the top results:

Hit Rate@5: Was the correct case ID among the top 5 retrieved results?
Mean Reciprocal Rank (MRR): On average, how high up in the ranking was the correct case ID? (1.0 means it was always ranked first).
Normalized Discounted Cumulative Gain (NDCG@5): Measures ranking quality, rewarding higher ranks for the correct item, considering the similarity scores as relevance.

Metric Score

Hit@5 1.0

MRR 1.0

NDCG@5 1.0

Metric	Score
Hit@5	1.0
MRR	1.0
NDCG@5	1.0

These results indicate that the RAG is highly effective at identifying and prioritizing the correct clinical case from the knowledge base based on a user's natural language query. Qualitative analysis also showed that simulations using RAG-retrieved cases felt significantly more realistic and clinically relevant compared to scenarios generated from scratch by an LLM without retrieval.

Usage and Intended Uses

This repository provides the building blocks for simulating a surgery oral board exam. The primary intended use is as an educational tool for surgical residents preparing for their board examinations.

Workflow:

Setup: Ensure all requirements from requirements.txt are installed. Set up Hugging Face authentication (e.g., via .env file) if needed for model downloads.
Load/Process Data: Run the data loading and preprocessing steps (demonstrated in notebooks/demo.ipynb) to create the processed_clinical_cases dataset with embeddings, if not already present.
Run Simulation: Use the OralExamSimulator class.

Example Code:

import os
import sys
# Add project root to path if running from notebooks/
project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
if project_root not in sys.path:
    sys.path.append(project_root)

from src.retriever import ClinicalCaseRetriever
from src.evaluator import AnswerEvaluator
from src.simulator import OralExamSimulator

# --- Configuration ---
PROCESSED_DATA_PATH = "./processed_clinical_cases" # Adjust path as needed
EVALUATOR_MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
EMBEDDING_MODEL_ID = "all-MiniLM-L6-v2"

# --- Initialize Components ---
# Assumes processed data exists and HF login is handled
retriever = ClinicalCaseRetriever(dataset_path=PROCESSED_DATA_PATH, model_name=EMBEDDING_MODEL_ID)
evaluator = AnswerEvaluator(model_id=EVALUATOR_MODEL_ID)
simulator = OralExamSimulator(retriever, evaluator)

# --- Start Simulation ---
query = "injury to the neck vessel"
case_info = simulator.start_new_case(clinical_query=query)

current_question = case_info.get('current_question')
print(f"Examiner: {current_question}")

# --- Get User Input & Process Turn 1 ---
user_answer = input("Your Answer: ")
result = simulator.process_user_response(user_answer)

# --- Turn 2: Get Next Question in Scenario ---
next_q_text = result.get('next_question')
print(next_q_text)

# --- Get User Input & Process Turn 2 ---
user_answer = input("Your Answer: ")
result = simulator.process_user_response(user_answer)

# Continues until all questions in scenario are asked...

Refer to notebooks/demo.ipynb for a more detailed, interactive demonstration.

Prompt Format (Answer Evaluator)

A key component of this RAG system is the automated feedback. This is generated by the AnswerEvaluator module, which uses the following structured prompt internally when querying the meta-llama/Llama-3.2-3B-Instruct model to assess the user's response:

<s>[INST] You are acting as an expert examiner... Use the grading rubric below...

[RUBRIC]
- Correct: ...
- Partially Correct: ...
- Incorrect: ...

Clinical context: {clinical_context}

Here is the model answer that contains the key points expected from the resident:
{expected_answer}

Now, here is the resident’s actual response:
{user_answer}

Evaluate the resident’s response based **only** on the expected answer above...

Focus your evaluation on:
1. Which key points were mentioned vs. missed
...

Start your output with:
ASSESSMENT: [Correct / Partially Correct / Incorrect]
Then write 1–2 clear, specific sentences explaining...

[EXAMPLE 1]
Expected answer: ...
Resident’s response: ...
ASSESSMENT: Partially Correct
...

[EXAMPLE 2]
Expected answer: ...
Resident’s response: ...
ASSESSMENT: Correct
...

[/INST]</s>

Expected Output Format

The simulation follows an interactive loop. Here is an example of the expected interaction flow and output format a user would experience when running the simulation (e.g., via the notebooks/demo.ipynb notebook):

Case selection:

User provides a topic query (e.g., bowel intussusception)

Starting simulation for query: 'bowel intussusception in a child'

System retrieves the relevant case and presents the first question.

Case Started: Intussusception Pediatrics (ID: 87A) | Similarity: 0.7767
Total Questions: 12

--- Question 1 ---
You're called to the emergency department to evaluate an 18-month-old boy with an eight-hour history of intermittent intense abdominal pain...

Turn 1:

User inputs their answer.

➡️ Your Turn (Question 1/12)
   Your Answer: I'd obtain vital signs and perform a history and physical, focusing on the abdomen and doing a rectal exam.

System processes the answer and provides feedback, followed by the next question.

⏳ Processing User Answer...
------------------------------------------------------------
📝 Feedback:
> ASSESSMENT: Partially Correct
> The resident mentioned obtaining vital signs and performing a history and physical examination... but omitted focusing on the patient's history and birth history mentioned in the expected answer.

❓ Next Question (2/12)

The patient's tachycardic, the rest of the vital signs are normal. Your exam reveals a toddler in the fetal position...
------------------------------------------------------------

Subsequent Turns:

The process repeats: user provides an answer to the current question, the system provides feedback and the next question.

Case Completion:

After the user answers the final question, the system provides feedback and indicates the case is complete.

⏳ Processing User Answer...
------------------------------------------------------------
📝 Feedback:
> ASSESSMENT: Correct
> [Feedback for the final answer...]

🏁 Case Complete!
------------------------------------------------------------

(Note: The feedback generated using 3B parameter model sometimes deviates slightly from this exact format, which is noted in the Limitations section.)

Limitations

Knowledge Base Scope: The system's ability to simulate scenarios is currently limited to the 98 cases. It cannot handle queries for topics not covered in this dataset.
Evaluator Consistency: The meta-llama/Llama-3.2-3B-Instruct model used for answer evaluation, while capable, sometimes struggles with strict adherence to the output format (e.g., occasionally omitting the ASSESSMENT: prefix) or provides feedback that could be more nuanced. Using a larger evaluation model or further prompt refinement could improve consistency.

melmoheb
/

boardgpt-llm