β Back to Onboarding Hub
Bonzai Leaderboard
Internal tool for tracking agent performance across benchmarks, comparing runs, and analyzing lab statistics.
What It Does
- Agent Rankings β View how agents perform across benchmarks (JINGLE, OSS, TBENCH)
- Lab Statistics β See success rates per lab, identify flaky targets
- Run Comparison β Compare two runs side-by-side
- Historical Trends β Track lab success rates over time
Architecture
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Svelte UI ββββββΆβ Flask API ββββββΆβ BigQuery β
β (TypeScript) β β (Python) β β + GCS β
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
| Component | Tech Stack | Purpose |
| Frontend | Svelte 5, TypeScript, Vite | Leaderboard tables, charts, comparison views |
| Backend | Flask, Python 3.13+ | API endpoints, data aggregation |
| Data | BigQuery, GCS | Agent run artifacts, metrics, reports |
| Auth | IAP | Google Workspace SSO |
Key Concepts
Benchmarks
| Benchmark | Description |
JINGLE | Jingle benchmark β specific lab set |
OSS | Open-source vulnerable apps (Juice Shop, DVWP, etc.) |
TBENCH | Tenzai-built complex applications |
Custom | User-defined lab selections |
Data Pipeline
Agent runs are stored in GCS with Hive-style partitioning:
gs://tenzai-agent-run-artifacts/
βββ year=*/month=*/day=*/hour=*/run_id=*/agent_id=*/
βββ run_config_fixed.json # Agent config
βββ report.json # Detailed findings
βββ agent-summary.json # Run summary
BigQuery external tables mount this data. Scheduled queries refresh native tables hourly.
Local Development
Prerequisites
- Python 3.13+ (via
uv)
- Node.js 22+ (via
nvm)
- Docker,
gcloud CLI, just task runner
Quick Start
# Clone the repo
git clone git@github.com:TenzaiLtd/leaderboard.git
cd leaderboard
# Setup UI (installs Node.js 22 and dependencies)
just setup-ui
# Run backend (Terminal 1)
just dev
# Run UI with hot reload (Terminal 2)
just ui
Common Commands
| Command | Description |
just dev | Run Flask backend |
just ui | Run Svelte UI dev server |
just build-ui | Build UI for production |
just docker-run | Build and run in Docker |
just deploy | Deploy to Cloud Run |
API Endpoints
| Endpoint | Description |
/api/leaderboard | Main leaderboard data (agent rankings) |
/api/leaderboard/benchmarks | Available benchmark configurations |
/api/leaderboard/run/<run_id>/details | Detailed metrics for a run |
/api/leaderboard/compare | Compare two runs |
/api/labs | Lab-level statistics |
/api/labs/<lab_name>/timeseries | Historical success rate |
GCP Resources
| Resource | Value |
| Project | annular-fold-460418-r3 |
| Dataset | evaluation_reports |
| GCS Bucket | gs://tenzai-agent-run-artifacts |
| Cloud Run Region | europe-west1 |
Repository
URL: https://github.com/TenzaiLtd/leaderboard
| Path | Purpose |
backend/ | Flask API, BigQuery queries, data logic |
ui/ | Svelte frontend |
static/ | Built UI assets |
justfile | Task runner commands |
Troubleshooting
GCP Auth Errors
gcloud auth login
gcloud auth application-default login
Data Not Showing
- New runs have up to 1 hour delay (scheduled query interval)
- Check if run exists in GCS:
gsutil ls gs://tenzai-agent-run-artifacts/year=.../
Node Version Issues
cd ui
nvm install
nvm use