File size: 5,540 Bytes
29f1683 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# AI Language Monitor - System Architecture
This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.
```mermaid
flowchart TD
%% Model Sources
A1["important_models<br/>Static Curated List"] --> D[load_models]
A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
A4["blocklist<br/>Exclusions"] --> D
%% Model Processing
D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
G --> H["Enriched Model DataFrame"]
H --> |Save| I[models.json]
%% Language Data
J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
%% Task Registry
L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions"]
M --> M1["translation_from/to<br/>BLEU + ChrF"]
M --> M2["classification<br/>Accuracy"]
M --> M3["mmlu<br/>Accuracy"]
M --> M4["arc<br/>Accuracy"]
M --> M5["truthfulqa<br/>Accuracy"]
M --> M6["mgsm<br/>Accuracy"]
%% Evaluation Pipeline
H --> |"models ID"| N["main.py evaluate"]
K --> |"languages bcp_47"| N
L --> |"tasks.items"| N
N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model Γ Language Γ Task"]
O --> |"10 samples each"| P["Evaluation Execution"]
%% Task Execution
P --> Q1[translate_and_evaluate]
P --> Q2[classify_and_evaluate]
P --> Q3[mmlu_and_evaluate]
P --> Q4[arc_and_evaluate]
P --> Q5[truthfulqa_and_evaluate]
P --> Q6[mgsm_and_evaluate]
%% API Calls
Q1 --> |"complete() API"| R["OpenRouter<br/>Model Inference"]
Q2 --> |"complete() API"| R
Q3 --> |"complete() API"| R
Q4 --> |"complete() API"| R
Q5 --> |"complete() API"| R
Q6 --> |"complete() API"| R
%% Results Processing
R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task"]
S --> |Save| T[results.json]
%% Backend & Frontend
T --> |Read| U[backend.py]
I --> |Read| U
U --> |make_model_table| V["Model Rankings"]
U --> |make_country_table| W["Country Aggregation"]
U --> |"API Endpoint"| X["FastAPI /api/data"]
X --> |"JSON Response"| Y["Frontend React App"]
%% UI Components
Y --> Z1["WorldMap.js<br/>Country Visualization"]
Y --> Z2["ModelTable.js<br/>Model Rankings"]
Y --> Z3["LanguageTable.js<br/>Language Coverage"]
Y --> Z4["DatasetTable.js<br/>Task Performance"]
%% Data Sources
subgraph DS ["Data Sources"]
DS1["Flores-200<br/>Translation Sentences"]
DS2["MMLU/AfriMMLU<br/>Knowledge QA"]
DS3["ARC<br/>Science Reasoning"]
DS4["TruthfulQA<br/>Truthfulness"]
DS5["MGSM<br/>Math Problems"]
end
DS1 --> Q1
DS2 --> Q3
DS3 --> Q4
DS4 --> Q5
DS5 --> Q6
%% Styling
classDef modelSource fill:#e1f5fe
classDef evaluation fill:#f3e5f5
classDef api fill:#fff3e0
classDef storage fill:#e8f5e8
classDef frontend fill:#fce4ec
class A1,A2,A3,A4 modelSource
class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
class R,F,G,X api
class T,I storage
class Y,Z1,Z2,Z3,Z4 frontend
```
## Architecture Components
### π΅ Model Discovery (Blue)
- **Static Curated Models**: Handpicked important models for comprehensive evaluation
- **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
- **Quality Control**: Blocklist for problematic or incompatible models
- **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
### π£ Evaluation Pipeline (Purple)
- **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
- **Combinatorial Approach**: Systematic evaluation across Model Γ Language Γ Task combinations
- **Sample-based**: 10 evaluations per combination for statistical reliability
- **Unified API**: All tasks use OpenRouter's `complete()` function for consistency
### π API Integration (Orange)
- **OpenRouter**: Primary model inference API for all language model tasks
- **HuggingFace**: Model metadata and open-source model information
- **Google Translate**: Specialized translation API for comparison baseline
### π’ Data Storage (Green)
- **results.json**: Aggregated evaluation scores and metrics
- **models.json**: Dynamic model list with metadata
- **languages.json**: Language information with population data
### π‘ Frontend Visualization (Pink)
- **WorldMap**: Interactive country-level language proficiency visualization
- **ModelTable**: Ranked model performance leaderboard
- **LanguageTable**: Language coverage and speaker statistics
- **DatasetTable**: Task-specific performance breakdowns
## Data Flow Summary
1. **Model Discovery**: Combine curated + trending models β enrich with metadata
2. **Evaluation Setup**: Generate all valid Model Γ Language Γ Task combinations
3. **Task Execution**: Run evaluations using appropriate datasets and APIs
4. **Result Processing**: Aggregate scores and save to JSON files
5. **Backend Serving**: FastAPI serves processed data via REST API
6. **Frontend Display**: React app visualizes data through interactive components
This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface. |