# AI Language Monitor - System Architecture This diagram shows the complete data flow from model discovery through evaluation to frontend visualization. ```mermaid flowchart TD %% Model Sources A1["important_models
Static Curated List"] --> D[load_models] A2["get_historical_popular_models
Web Scraping - Top 20"] --> D A3["get_current_popular_models
Web Scraping - Top 10"] --> D A4["blocklist
Exclusions"] --> D %% Model Processing D --> |"Combine & Dedupe"| E["Dynamic Model List
~40-50 models"] E --> |get_or_metadata| F["OpenRouter API
Model Metadata"] F --> |get_hf_metadata| G["HuggingFace API
Model Details"] G --> H["Enriched Model DataFrame"] H --> |Save| I[models.json] %% Language Data J["languages.py
BCP-47 + Population"] --> K["Top 100 Languages"] %% Task Registry L["tasks.py
7 Evaluation Tasks"] --> M["Task Functions"] M --> M1["translation_from/to
BLEU + ChrF"] M --> M2["classification
Accuracy"] M --> M3["mmlu
Accuracy"] M --> M4["arc
Accuracy"] M --> M5["truthfulqa
Accuracy"] M --> M6["mgsm
Accuracy"] %% Evaluation Pipeline H --> |"models ID"| N["main.py evaluate"] K --> |"languages bcp_47"| N L --> |"tasks.items"| N N --> |"Filter by model.tasks"| O["Valid Combinations
Model × Language × Task"] O --> |"10 samples each"| P["Evaluation Execution"] %% Task Execution P --> Q1[translate_and_evaluate] P --> Q2[classify_and_evaluate] P --> Q3[mmlu_and_evaluate] P --> Q4[arc_and_evaluate] P --> Q5[truthfulqa_and_evaluate] P --> Q6[mgsm_and_evaluate] %% API Calls Q1 --> |"complete() API"| R["OpenRouter
Model Inference"] Q2 --> |"complete() API"| R Q3 --> |"complete() API"| R Q4 --> |"complete() API"| R Q5 --> |"complete() API"| R Q6 --> |"complete() API"| R %% Results Processing R --> |Scores| S["Result Aggregation
Mean by model+lang+task"] S --> |Save| T[results.json] %% Backend & Frontend T --> |Read| U[backend.py] I --> |Read| U U --> |make_model_table| V["Model Rankings"] U --> |make_country_table| W["Country Aggregation"] U --> |"API Endpoint"| X["FastAPI /api/data"] X --> |"JSON Response"| Y["Frontend React App"] %% UI Components Y --> Z1["WorldMap.js
Country Visualization"] Y --> Z2["ModelTable.js
Model Rankings"] Y --> Z3["LanguageTable.js
Language Coverage"] Y --> Z4["DatasetTable.js
Task Performance"] %% Data Sources subgraph DS ["Data Sources"] DS1["Flores-200
Translation Sentences"] DS2["MMLU/AfriMMLU
Knowledge QA"] DS3["ARC
Science Reasoning"] DS4["TruthfulQA
Truthfulness"] DS5["MGSM
Math Problems"] end DS1 --> Q1 DS2 --> Q3 DS3 --> Q4 DS4 --> Q5 DS5 --> Q6 %% Styling classDef modelSource fill:#e1f5fe classDef evaluation fill:#f3e5f5 classDef api fill:#fff3e0 classDef storage fill:#e8f5e8 classDef frontend fill:#fce4ec class A1,A2,A3,A4 modelSource class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation class R,F,G,X api class T,I storage class Y,Z1,Z2,Z3,Z4 frontend ``` ## Architecture Components ### 🔵 Model Discovery (Blue) - **Static Curated Models**: Handpicked important models for comprehensive evaluation - **Dynamic Popular Models**: Real-time discovery of trending models via web scraping - **Quality Control**: Blocklist for problematic or incompatible models - **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs ### 🟣 Evaluation Pipeline (Purple) - **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM - **Combinatorial Approach**: Systematic evaluation across Model × Language × Task combinations - **Sample-based**: 10 evaluations per combination for statistical reliability - **Unified API**: All tasks use OpenRouter's `complete()` function for consistency ### 🟠 API Integration (Orange) - **OpenRouter**: Primary model inference API for all language model tasks - **HuggingFace**: Model metadata and open-source model information - **Google Translate**: Specialized translation API for comparison baseline ### 🟢 Data Storage (Green) - **results.json**: Aggregated evaluation scores and metrics - **models.json**: Dynamic model list with metadata - **languages.json**: Language information with population data ### 🟡 Frontend Visualization (Pink) - **WorldMap**: Interactive country-level language proficiency visualization - **ModelTable**: Ranked model performance leaderboard - **LanguageTable**: Language coverage and speaker statistics - **DatasetTable**: Task-specific performance breakdowns ## Data Flow Summary 1. **Model Discovery**: Combine curated + trending models → enrich with metadata 2. **Evaluation Setup**: Generate all valid Model × Language × Task combinations 3. **Task Execution**: Run evaluations using appropriate datasets and APIs 4. **Result Processing**: Aggregate scores and save to JSON files 5. **Backend Serving**: FastAPI serves processed data via REST API 6. **Frontend Display**: React app visualizes data through interactive components This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface.