Contributing to TheDataGuy Chat
Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.
Project Overview
TheDataGuy Chat is a Q&A chatbot powered by the content from TheDataGuy blog. It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.
Development Environment Setup
Prerequisites
- Python 3.13 or higher
- uv for Python package management
- Docker (optional, for containerized development)
- OpenAI API key
Local Setup
Clone the repository:
git clone https://github.com/mafzaal/lets-talk.git cd lets-talk
Create a
.env
file with the necessary environment variables:OPENAI_API_KEY=your_openai_api_key VECTOR_STORAGE_PATH=./db/vector_store_tdg LLM_MODEL=gpt-4o-mini EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Vector Database Creation Configuration (optional) FORCE_RECREATE=False # Whether to force recreation of the vector store OUTPUT_DIR=./stats # Directory to save stats and artifacts USE_CHUNKING=True # Whether to split documents into chunks SHOULD_SAVE_STATS=True # Whether to save statistics about the documents
Install dependencies:
uv init && uv sync
Build the vector store:
./scripts/build-vector-store.sh
Run the application:
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
Using Docker
Build the Docker image:
docker build -t lets-talk .
Run the container:
docker run -p 7860:7860 --env-file ./.env lets-talk
Project Structure
lets-talk/
βββ data/ # Raw blog post content
βββ py-src/ # Python source code
β βββ lets_talk/ # Core application modules
β β βββ agent.py # Agent implementation
β β βββ config.py # Configuration settings
β β βββ models.py # Data models
β β βββ prompts.py # LLM prompt templates
β β βββ rag.py # RAG implementation
β β βββ rss_tool.py # RSS feed integration
β β βββ tools.py # Tool implementations
β β βββ utils/ # Utility functions
β βββ app.py # Main application entry point
β βββ pipeline.py # Data processing pipeline
β βββ notebooks/ # Jupyter notebooks for analysis
βββ db/ # Vector database storage
βββ evals/ # Evaluation datasets and results
βββ scripts/ # Utility scripts
Adding New Blog Posts
When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:
- Add the markdown content to the
data/
directory in a new folder named after the post slug - Run the vector store update script:
python py-src/pipeline.py --force-recreate
Workflow
- Fork the repository on GitHub
- Clone your fork to your local machine
- Create a new branch for your feature or bug fix
- Make your changes
- Run the tests to ensure everything works
- Commit your changes with clear, descriptive commit messages
- Push your branch to your fork on GitHub
- Submit a Pull Request to the main repository
Code Style
- Follow PEP 8 style guidelines for Python code
- Use meaningful variable and function names
- Add docstrings to all functions and classes
- Include type hints where appropriate
Testing
- Write tests for new features and bug fixes
- Ensure all tests pass before submitting a Pull Request
- Use the Ragas evaluation framework to test RAG performance
Documentation
- Update relevant documentation when making changes
- Add docstrings to all functions, classes, and modules
- Keep the README and other documentation up to date
License
By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).
Contact
If you have any questions or need further clarification, please reach out to the project maintainer at contact form.