davidpomerenke commited on
Commit
29f1683
Β·
verified Β·
1 Parent(s): 6234f5c

Upload from GitHub Actions: added system architecture overview

Browse files
Files changed (1) hide show
  1. system_architecture_diagram.md +141 -0
system_architecture_diagram.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Language Monitor - System Architecture
2
+
3
+ This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.
4
+
5
+ ```mermaid
6
+ flowchart TD
7
+ %% Model Sources
8
+ A1["important_models<br/>Static Curated List"] --> D[load_models]
9
+ A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
10
+ A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
11
+ A4["blocklist<br/>Exclusions"] --> D
12
+
13
+ %% Model Processing
14
+ D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
15
+ E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
16
+ F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
17
+ G --> H["Enriched Model DataFrame"]
18
+ H --> |Save| I[models.json]
19
+
20
+ %% Language Data
21
+ J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
22
+
23
+ %% Task Registry
24
+ L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions"]
25
+ M --> M1["translation_from/to<br/>BLEU + ChrF"]
26
+ M --> M2["classification<br/>Accuracy"]
27
+ M --> M3["mmlu<br/>Accuracy"]
28
+ M --> M4["arc<br/>Accuracy"]
29
+ M --> M5["truthfulqa<br/>Accuracy"]
30
+ M --> M6["mgsm<br/>Accuracy"]
31
+
32
+ %% Evaluation Pipeline
33
+ H --> |"models ID"| N["main.py evaluate"]
34
+ K --> |"languages bcp_47"| N
35
+ L --> |"tasks.items"| N
36
+ N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model Γ— Language Γ— Task"]
37
+ O --> |"10 samples each"| P["Evaluation Execution"]
38
+
39
+ %% Task Execution
40
+ P --> Q1[translate_and_evaluate]
41
+ P --> Q2[classify_and_evaluate]
42
+ P --> Q3[mmlu_and_evaluate]
43
+ P --> Q4[arc_and_evaluate]
44
+ P --> Q5[truthfulqa_and_evaluate]
45
+ P --> Q6[mgsm_and_evaluate]
46
+
47
+ %% API Calls
48
+ Q1 --> |"complete() API"| R["OpenRouter<br/>Model Inference"]
49
+ Q2 --> |"complete() API"| R
50
+ Q3 --> |"complete() API"| R
51
+ Q4 --> |"complete() API"| R
52
+ Q5 --> |"complete() API"| R
53
+ Q6 --> |"complete() API"| R
54
+
55
+ %% Results Processing
56
+ R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task"]
57
+ S --> |Save| T[results.json]
58
+
59
+ %% Backend & Frontend
60
+ T --> |Read| U[backend.py]
61
+ I --> |Read| U
62
+ U --> |make_model_table| V["Model Rankings"]
63
+ U --> |make_country_table| W["Country Aggregation"]
64
+ U --> |"API Endpoint"| X["FastAPI /api/data"]
65
+ X --> |"JSON Response"| Y["Frontend React App"]
66
+
67
+ %% UI Components
68
+ Y --> Z1["WorldMap.js<br/>Country Visualization"]
69
+ Y --> Z2["ModelTable.js<br/>Model Rankings"]
70
+ Y --> Z3["LanguageTable.js<br/>Language Coverage"]
71
+ Y --> Z4["DatasetTable.js<br/>Task Performance"]
72
+
73
+ %% Data Sources
74
+ subgraph DS ["Data Sources"]
75
+ DS1["Flores-200<br/>Translation Sentences"]
76
+ DS2["MMLU/AfriMMLU<br/>Knowledge QA"]
77
+ DS3["ARC<br/>Science Reasoning"]
78
+ DS4["TruthfulQA<br/>Truthfulness"]
79
+ DS5["MGSM<br/>Math Problems"]
80
+ end
81
+
82
+ DS1 --> Q1
83
+ DS2 --> Q3
84
+ DS3 --> Q4
85
+ DS4 --> Q5
86
+ DS5 --> Q6
87
+
88
+ %% Styling
89
+ classDef modelSource fill:#e1f5fe
90
+ classDef evaluation fill:#f3e5f5
91
+ classDef api fill:#fff3e0
92
+ classDef storage fill:#e8f5e8
93
+ classDef frontend fill:#fce4ec
94
+
95
+ class A1,A2,A3,A4 modelSource
96
+ class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
97
+ class R,F,G,X api
98
+ class T,I storage
99
+ class Y,Z1,Z2,Z3,Z4 frontend
100
+ ```
101
+
102
+ ## Architecture Components
103
+
104
+ ### πŸ”΅ Model Discovery (Blue)
105
+ - **Static Curated Models**: Handpicked important models for comprehensive evaluation
106
+ - **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
107
+ - **Quality Control**: Blocklist for problematic or incompatible models
108
+ - **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
109
+
110
+ ### 🟣 Evaluation Pipeline (Purple)
111
+ - **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
112
+ - **Combinatorial Approach**: Systematic evaluation across Model Γ— Language Γ— Task combinations
113
+ - **Sample-based**: 10 evaluations per combination for statistical reliability
114
+ - **Unified API**: All tasks use OpenRouter's `complete()` function for consistency
115
+
116
+ ### 🟠 API Integration (Orange)
117
+ - **OpenRouter**: Primary model inference API for all language model tasks
118
+ - **HuggingFace**: Model metadata and open-source model information
119
+ - **Google Translate**: Specialized translation API for comparison baseline
120
+
121
+ ### 🟒 Data Storage (Green)
122
+ - **results.json**: Aggregated evaluation scores and metrics
123
+ - **models.json**: Dynamic model list with metadata
124
+ - **languages.json**: Language information with population data
125
+
126
+ ### 🟑 Frontend Visualization (Pink)
127
+ - **WorldMap**: Interactive country-level language proficiency visualization
128
+ - **ModelTable**: Ranked model performance leaderboard
129
+ - **LanguageTable**: Language coverage and speaker statistics
130
+ - **DatasetTable**: Task-specific performance breakdowns
131
+
132
+ ## Data Flow Summary
133
+
134
+ 1. **Model Discovery**: Combine curated + trending models β†’ enrich with metadata
135
+ 2. **Evaluation Setup**: Generate all valid Model Γ— Language Γ— Task combinations
136
+ 3. **Task Execution**: Run evaluations using appropriate datasets and APIs
137
+ 4. **Result Processing**: Aggregate scores and save to JSON files
138
+ 5. **Backend Serving**: FastAPI serves processed data via REST API
139
+ 6. **Frontend Display**: React app visualizes data through interactive components
140
+
141
+ This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface.