lets_talk / CONTRIBUTING.md
mafzaal's picture
Add vector database creation configuration and update related scripts
a092eef

Contributing to TheDataGuy Chat

Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.

Project Overview

TheDataGuy Chat is a Q&A chatbot powered by the content from TheDataGuy blog. It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.

Development Environment Setup

Prerequisites

  • Python 3.13 or higher
  • uv for Python package management
  • Docker (optional, for containerized development)
  • OpenAI API key

Local Setup

  1. Clone the repository:

    git clone https://github.com/mafzaal/lets-talk.git
    cd lets-talk
    
  2. Create a .env file with the necessary environment variables:

    OPENAI_API_KEY=your_openai_api_key
    VECTOR_STORAGE_PATH=./db/vector_store_tdg
    LLM_MODEL=gpt-4o-mini
    EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
    
    # Vector Database Creation Configuration (optional)
    FORCE_RECREATE=False      # Whether to force recreation of the vector store
    OUTPUT_DIR=./stats        # Directory to save stats and artifacts
    USE_CHUNKING=True         # Whether to split documents into chunks
    SHOULD_SAVE_STATS=True    # Whether to save statistics about the documents
    
  3. Install dependencies:

    uv init && uv sync
    
  4. Build the vector store:

    ./scripts/build-vector-store.sh
    
  5. Run the application:

    chainlit run py-src/app.py --host 0.0.0.0 --port 7860
    

Using Docker

  1. Build the Docker image:

    docker build -t lets-talk .
    
  2. Run the container:

    docker run -p 7860:7860 --env-file ./.env lets-talk
    

Project Structure

lets-talk/
β”œβ”€β”€ data/                  # Raw blog post content
β”œβ”€β”€ py-src/                # Python source code
β”‚   β”œβ”€β”€ lets_talk/         # Core application modules
β”‚   β”‚   β”œβ”€β”€ agent.py       # Agent implementation
β”‚   β”‚   β”œβ”€β”€ config.py      # Configuration settings
β”‚   β”‚   β”œβ”€β”€ models.py      # Data models
β”‚   β”‚   β”œβ”€β”€ prompts.py     # LLM prompt templates
β”‚   β”‚   β”œβ”€β”€ rag.py         # RAG implementation
β”‚   β”‚   β”œβ”€β”€ rss_tool.py    # RSS feed integration
β”‚   β”‚   β”œβ”€β”€ tools.py       # Tool implementations
β”‚   β”‚   └── utils/         # Utility functions
β”‚   β”œβ”€β”€ app.py             # Main application entry point
β”‚   β”œβ”€β”€ pipeline.py        # Data processing pipeline
β”‚   └── notebooks/         # Jupyter notebooks for analysis
β”œβ”€β”€ db/                    # Vector database storage
β”œβ”€β”€ evals/                 # Evaluation datasets and results
└── scripts/               # Utility scripts

Adding New Blog Posts

When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:

  1. Add the markdown content to the data/ directory in a new folder named after the post slug
  2. Run the vector store update script:
    python py-src/pipeline.py --force-recreate
    

Workflow

  1. Fork the repository on GitHub
  2. Clone your fork to your local machine
  3. Create a new branch for your feature or bug fix
  4. Make your changes
  5. Run the tests to ensure everything works
  6. Commit your changes with clear, descriptive commit messages
  7. Push your branch to your fork on GitHub
  8. Submit a Pull Request to the main repository

Code Style

  • Follow PEP 8 style guidelines for Python code
  • Use meaningful variable and function names
  • Add docstrings to all functions and classes
  • Include type hints where appropriate

Testing

  • Write tests for new features and bug fixes
  • Ensure all tests pass before submitting a Pull Request
  • Use the Ragas evaluation framework to test RAG performance

Documentation

  • Update relevant documentation when making changes
  • Add docstrings to all functions, classes, and modules
  • Keep the README and other documentation up to date

License

By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).

Contact

If you have any questions or need further clarification, please reach out to the project maintainer at contact form.