Benchmarks | Tenzai Onboarding

The goal of T-Bench is to provide a standardized and repeatable way to evaluate the effectiveness of AI security scanners and penetration testing agents.

Benchmark AI security scanners against realistic applications with known vulnerabilities
Test penetration testing AI agents on diverse tech stacks, auth mechanisms, and business logic
Measure detection rates across different application sizes and complexity levels

Benchmark Categories

The labs repository contains three tiers of security benchmarks:

Category	Count	Description	Complexity
T-Bench (Medium Apps)	26 apps	Custom-built realistic business applications	Medium-High
OSS (Large Apps)	~8 apps	Real-world open-source software with known CVEs	High
CTF Challenges (Small)	108+ labs	Focused single-vulnerability challenges	Low-Medium

1. T-Bench Applications (Medium Complexity)

Custom-built, medium-complexity business applications simulating real enterprise software.

Metric	Value
Total Apps	26 (app-001 to app-023b)
Avg Endpoints/App	23.7
Tech Diversity	8 backend languages, 4 databases

Example Apps

app-022 (Modern Banking): Rails + React + GraphQL + SAML SSO. Accounts, transfers, loans, check scanning.
app-010 (Performance Review): Go + Angular + MongoDB microservices. 360 feedback, calibration.
app-005 (Corporate Comms): PHP/Laravel. Announcements, newsletters, approval workflows.

2. OSS Applications (Large/Complex)

Real-world open-source applications with documented CVEs and vulnerabilities.

App	Description	Findings
Zabbix 6.4/7.0	Enterprise monitoring platform (PHP + PostgreSQL)	4 (SQLi, priv escalation)
OpenFire 3.6.0	XMPP instant messaging server (Java)	12 (auth bypass, XSS, SQLi)
OpenCart	E-commerce platform (PHP)	7 (path traversal, code injection)
DVWP	Damn Vulnerable WordPress	4 (code injection, path traversal)
OrangeHRM	HR management system	Known CVEs

Additional benchmarks include:

tenzai-bench-login-{1,2,3}: SSRF and command injection challenges

3. CTF Challenges (Small/Focused)

Single-vulnerability challenges for focused testing.

XBEN (104 challenges)

External CTF platform challenges (xben-001-24 through xben-104-24)
82 working labs (excluding deployment failures)
Categories: IDOR, default credentials, injection, auth bypass
Example: xben-001-24. Trading platform IDOR + default credentials

WSCOIL (4 challenges)

PHP-focused web exploitation
wscoil-try-me: LFI via PHP filter chain gadgets
wscoil-lager-lounge, wscoil-just-want-tahini, wscoil-operations-console

Additional small challenges include:

CVE-Bench: 40+ real CVE reproductions (CVE-2024-* series) with automated grading

Grounded Reports

Each lab with known vulnerabilities includes grounded-report.md files that document the exact vulnerabilities and step-by-step exploitation procedures.

Purpose

Grounded reports serve as:

Ground truth for evaluating AI security scanner accuracy
Exploitation playbooks with reproducible attack steps
Training data for security testing agents
Benchmark answers to verify if a scanner correctly identifies vulnerabilities

Markdown Format (`grounded-report.md`)

Vulnerability N

Title: Descriptive name of the vulnerability

CWE: CWE-XXX (or CWE-XXX OR CWE-YYY for alternative classifications)

Name: Official CWE name

Severity:

Attack Vector: Network | Adjacent | Local | Physical
Attack Complexity: Low | High
Attack Requirements: None | Present
Privileges Required: None | Low | High
User Interaction: None | Passive | Active
Confidentiality: None | Low | High
Integrity: None | Low | High
Availability: None | Low | High

Endpoint: METHOD /path/to/endpoint

Assets: Prerequisites needed (credentials, session cookies, IDs)

Description: Technical explanation of the vulnerability root cause.

Impact: Business and security consequences of exploitation.

Reproduction:

Step:
- Action: Exact HTTP request or command with full payload
- Entity: What component is being attacked
- Required Assets: What you need for this step
- Derived Assets: What you gain from this step

JSON Format (`grounded-report.json`)

Machine-readable version containing:

Structured endpoint definitions with URL patterns and parameter schemas
Parameter locations (BODY, COOKIE, QUERY, PATH)
Auth requirements and parameter types
Used by evaluation framework for automated grading

Infrastructure

Feature	Implementation
Deployment	Kubernetes (local Docker Desktop/minikube + remote GKE)
Management	Dojo CLI in `TenzaiLtd/evaluation` repo
Access	ZeroTier VPN for remote GKE endpoints
Secrets	SOPS + GCP KMS encryption

Deploying a lab? See Lab Bringup for step-by-step instructions on running labs locally and on GKE.

Benchmark Categories

1. T-Bench Applications (Medium Complexity)

Example Apps

2. OSS Applications (Large/Complex)

3. CTF Challenges (Small/Focused)

XBEN (104 challenges)

WSCOIL (4 challenges)

Grounded Reports

Purpose

Markdown Format (grounded-report.md)

Vulnerability N

JSON Format (grounded-report.json)

Infrastructure

Markdown Format (`grounded-report.md`)

JSON Format (`grounded-report.json`)