# AI Language Monitor - System Architecture
This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.
```mermaid
flowchart TD
%% Model Sources
A1["important_models
Static Curated List"] --> D[load_models]
A2["get_historical_popular_models
Web Scraping - Top 20"] --> D
A3["get_current_popular_models
Web Scraping - Top 10"] --> D
A4["blocklist
Exclusions"] --> D
%% Model Processing
D --> |"Combine & Dedupe"| E["Dynamic Model List
~40-50 models"]
E --> |get_or_metadata| F["OpenRouter API
Model Metadata"]
F --> |get_hf_metadata| G["HuggingFace API
Model Details"]
G --> H["Enriched Model DataFrame"]
H --> |Save| I[models.json]
%% Language Data
J["languages.py
BCP-47 + Population"] --> K["Top 100 Languages"]
%% Task Registry
L["tasks.py
7 Evaluation Tasks"] --> M["Task Functions"]
M --> M1["translation_from/to
BLEU + ChrF"]
M --> M2["classification
Accuracy"]
M --> M3["mmlu
Accuracy"]
M --> M4["arc
Accuracy"]
M --> M5["truthfulqa
Accuracy"]
M --> M6["mgsm
Accuracy"]
%% Evaluation Pipeline
H --> |"models ID"| N["main.py evaluate"]
K --> |"languages bcp_47"| N
L --> |"tasks.items"| N
N --> |"Filter by model.tasks"| O["Valid Combinations
Model × Language × Task"]
O --> |"10 samples each"| P["Evaluation Execution"]
%% Task Execution
P --> Q1[translate_and_evaluate]
P --> Q2[classify_and_evaluate]
P --> Q3[mmlu_and_evaluate]
P --> Q4[arc_and_evaluate]
P --> Q5[truthfulqa_and_evaluate]
P --> Q6[mgsm_and_evaluate]
%% API Calls
Q1 --> |"complete() API"| R["OpenRouter
Model Inference"]
Q2 --> |"complete() API"| R
Q3 --> |"complete() API"| R
Q4 --> |"complete() API"| R
Q5 --> |"complete() API"| R
Q6 --> |"complete() API"| R
%% Results Processing
R --> |Scores| S["Result Aggregation
Mean by model+lang+task"]
S --> |Save| T[results.json]
%% Backend & Frontend
T --> |Read| U[backend.py]
I --> |Read| U
U --> |make_model_table| V["Model Rankings"]
U --> |make_country_table| W["Country Aggregation"]
U --> |"API Endpoint"| X["FastAPI /api/data"]
X --> |"JSON Response"| Y["Frontend React App"]
%% UI Components
Y --> Z1["WorldMap.js
Country Visualization"]
Y --> Z2["ModelTable.js
Model Rankings"]
Y --> Z3["LanguageTable.js
Language Coverage"]
Y --> Z4["DatasetTable.js
Task Performance"]
%% Data Sources
subgraph DS ["Data Sources"]
DS1["Flores-200
Translation Sentences"]
DS2["MMLU/AfriMMLU
Knowledge QA"]
DS3["ARC
Science Reasoning"]
DS4["TruthfulQA
Truthfulness"]
DS5["MGSM
Math Problems"]
end
DS1 --> Q1
DS2 --> Q3
DS3 --> Q4
DS4 --> Q5
DS5 --> Q6
%% Styling
classDef modelSource fill:#e1f5fe
classDef evaluation fill:#f3e5f5
classDef api fill:#fff3e0
classDef storage fill:#e8f5e8
classDef frontend fill:#fce4ec
class A1,A2,A3,A4 modelSource
class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
class R,F,G,X api
class T,I storage
class Y,Z1,Z2,Z3,Z4 frontend
```
## Architecture Components
### 🔵 Model Discovery (Blue)
- **Static Curated Models**: Handpicked important models for comprehensive evaluation
- **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
- **Quality Control**: Blocklist for problematic or incompatible models
- **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
### 🟣 Evaluation Pipeline (Purple)
- **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
- **Combinatorial Approach**: Systematic evaluation across Model × Language × Task combinations
- **Sample-based**: 10 evaluations per combination for statistical reliability
- **Unified API**: All tasks use OpenRouter's `complete()` function for consistency
### 🟠 API Integration (Orange)
- **OpenRouter**: Primary model inference API for all language model tasks
- **HuggingFace**: Model metadata and open-source model information
- **Google Translate**: Specialized translation API for comparison baseline
### 🟢 Data Storage (Green)
- **results.json**: Aggregated evaluation scores and metrics
- **models.json**: Dynamic model list with metadata
- **languages.json**: Language information with population data
### 🟡 Frontend Visualization (Pink)
- **WorldMap**: Interactive country-level language proficiency visualization
- **ModelTable**: Ranked model performance leaderboard
- **LanguageTable**: Language coverage and speaker statistics
- **DatasetTable**: Task-specific performance breakdowns
## Data Flow Summary
1. **Model Discovery**: Combine curated + trending models → enrich with metadata
2. **Evaluation Setup**: Generate all valid Model × Language × Task combinations
3. **Task Execution**: Run evaluations using appropriate datasets and APIs
4. **Result Processing**: Aggregate scores and save to JSON files
5. **Backend Serving**: FastAPI serves processed data via REST API
6. **Frontend Display**: React app visualizes data through interactive components
This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface.