Bonzai Leaderboard

Internal tool for tracking agent performance across benchmarks, comparing runs, and analyzing lab statistics.

🌐 Public URL

https://leaderboard.tenzai.io/

What It Does

Agent Rankings — View how agents perform across benchmarks (JINGLE, OSS, TBENCH)
Lab Statistics — See success rates per lab, identify flaky targets
Run Comparison — Compare two runs side-by-side
Historical Trends — Track lab success rates over time

Architecture

┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ Svelte UI │────▶│ Flask API │────▶│ BigQuery │ │ (TypeScript) │ │ (Python) │ │ + GCS │ └─────────────────┘ └──────────────────┘ └──────────────────┘

Component	Tech Stack	Purpose
Frontend	Svelte 5, TypeScript, Vite	Leaderboard tables, charts, comparison views
Backend	Flask, Python 3.13+	API endpoints, data aggregation
Data	BigQuery, GCS	Agent run artifacts, metrics, reports
Auth	IAP	Google Workspace SSO

Key Concepts

Benchmarks

Benchmark	Description
`JINGLE`	Jingle benchmark — specific lab set
`OSS`	Open-source vulnerable apps (Juice Shop, DVWP, etc.)
`TBENCH`	Tenzai-built complex applications
`Custom`	User-defined lab selections

Data Pipeline

Agent runs are stored in GCS with Hive-style partitioning:

gs://tenzai-agent-run-artifacts/
└── year=*/month=*/day=*/hour=*/run_id=*/agent_id=*/
    ├── run_config_fixed.json   # Agent config
    ├── report.json             # Detailed findings
    └── agent-summary.json      # Run summary

BigQuery external tables mount this data. Scheduled queries refresh native tables hourly.

Local Development

Prerequisites

Python 3.13+ (via uv)
Node.js 22+ (via nvm)
Docker, gcloud CLI, just task runner

Quick Start

# Clone the repo
git clone git@github.com:TenzaiLtd/leaderboard.git
cd leaderboard

# Setup UI (installs Node.js 22 and dependencies)
just setup-ui

# Run backend (Terminal 1)
just dev

# Run UI with hot reload (Terminal 2)
just ui

URLs

Backend: http://localhost:8080
UI dev server: http://localhost:5173

Common Commands

Command	Description
`just dev`	Run Flask backend
`just ui`	Run Svelte UI dev server
`just build-ui`	Build UI for production
`just docker-run`	Build and run in Docker
`just deploy`	Deploy to Cloud Run

API Endpoints

Endpoint	Description
`/api/leaderboard`	Main leaderboard data (agent rankings)
`/api/leaderboard/benchmarks`	Available benchmark configurations
`/api/leaderboard/run/<run_id>/details`	Detailed metrics for a run
`/api/leaderboard/compare`	Compare two runs
`/api/labs`	Lab-level statistics
`/api/labs/<lab_name>/timeseries`	Historical success rate

GCP Resources

Resource	Value
Project	`annular-fold-460418-r3`
Dataset	`evaluation_reports`
GCS Bucket	`gs://tenzai-agent-run-artifacts`
Cloud Run Region	`europe-west1`

Repository

URL: https://github.com/TenzaiLtd/leaderboard

Path	Purpose
`backend/`	Flask API, BigQuery queries, data logic
`ui/`	Svelte frontend
`static/`	Built UI assets
`justfile`	Task runner commands

Troubleshooting

GCP Auth Errors

gcloud auth login
gcloud auth application-default login

Data Not Showing

New runs have up to 1 hour delay (scheduled query interval)
Check if run exists in GCS: gsutil ls gs://tenzai-agent-run-artifacts/year=.../

Node Version Issues

cd ui
nvm install
nvm use