Methodology

How Public Bench designs, runs, and publishes benchmarks.

Every benchmark Public Bench publishes follows the same pipeline — from practitioner roundtable to published report card.

Core commitments

Principles

Every Public Bench benchmark is built and published under three commitments that define how we work and distinguish us from internal vendor evaluation or consultant-led assessments.

01

Independent

No vendor relationships. No self-grading. Judge models are always different from the system under test. We have no financial stake in any benchmark outcome.

02

Peer-validated

Rubrics are built with the practitioners who do the work — 311 operators, procurement officers, attorneys, police. Not with vendors, not in isolation.

03

Open by default

Methodology, rubrics, and judge prompts are public. Cities can rerun every benchmark themselves, for free, using our open-source code.

How benchmarks are built

The pipeline

Every benchmark Public Bench publishes follows the same five-stage pipeline. The pipeline is designed to be reproducible, phase-separated, and provider-agnostic.

Stage 0 — Roundtable

Before any automated testing begins, public servants with direct experience of the use case convene to define what good looks like, validate test scenarios, and confirm the benchmark discriminates meaningfully between good and poor AI behavior.

Stage A — Test Suite

Domain-specific YAML items, each containing: a scenario, ground truth, scoring rubric, and a judge prompt template. All items are published openly on GitHub.

Stage B — Proctor

An automated component submits inputs to the AI system under evaluation and collects verbatim outputs for every test item.

Stage C — Judge

A separate LLM — always different from the system under test — scores each output against the rubric and reports a confidence level alongside the score.

Stage D — Reporter

Scores are aggregated by dimension and risk level. A PDF report card is generated with full methodology disclosure, flagged items, and confidence intervals.

How scores are calculated

Scoring framework

Each test item has a weighted rubric with criteria that sum to 100 points. The judge LLM scores each criterion independently and reports a 0–100 overall score alongside a confidence level.

Confidence gating

Items where confidence falls below 0.70, or where the score is below 50, are automatically flagged for human review before appearing in reports. Flagged items are disclosed in the report card.

The judge model operates at temperature 0 for reproducibility. It is instructed to score against the rubric only — it has no knowledge of the chatbot vendor or city identity during scoring.

Grading scale — used across all benchmarks:

GradeScore
A≥ 90
B80–89
C70–79
D60–69
F< 60

Auditability

Reproducibility

Every benchmark run stores a complete provenance record:

  • The exact test items and rubric version used
  • The judge model, provider, and temperature settings
  • All raw system outputs (verbatim)
  • All per-item scores and confidence levels
  • Any ground truth answers and verification findings
  • Timestamp of the run

All of this is included in the PDF report and accessible via the results page. Public Bench does not edit or redact run records.

Conflicts of interest

Independence

Public Bench does not accept payment from AI vendors. We do not offer paid certification or premium placement on leaderboards. We do not allow vendors to review benchmark results before they are published.

Judge models are always selected from a different provider than the system under test. If a system is built on Anthropic Claude, the judge will be a Google or OpenAI model, and vice versa.

Our commitment

If you believe a benchmark is flawed, unfair, or incomplete, our methodology and rubrics are public. File an issue on GitHub or contact us directly. We will investigate and respond publicly.

Active Benchmarks

Benchmarks currently available

Each benchmark follows the pipeline described above with a use-case-specific test suite, rubric, and ground truth framework developed through a practitioner roundtable.

Live

311 Chatbots

20-item evaluation suite covering task performance, safety, and accessibility for AI-powered 311 service chatbots.

View benchmark →

Roundtable pending

Generative AI for Police Reports

Benchmark in development. Practitioner roundtable not yet scheduled.

Volunteer for the roundtable →