Methodology
How Public Bench designs, runs, and publishes benchmarks.
Every benchmark Public Bench publishes follows the same pipeline — from practitioner roundtable to published report card.
Core commitments
Principles
Every Public Bench benchmark is built and published under three commitments that define how we work and distinguish us from internal vendor evaluation or consultant-led assessments.
01
Independent
No vendor relationships. No self-grading. Judge models are always different from the system under test. We have no financial stake in any benchmark outcome.
02
Peer-validated
Rubrics are built with the practitioners who do the work — 311 operators, procurement officers, attorneys, police. Not with vendors, not in isolation.
03
Open by default
Methodology, rubrics, and judge prompts are public. Cities can rerun every benchmark themselves, for free, using our open-source code.
How benchmarks are built
The pipeline
Every benchmark Public Bench publishes follows the same five-stage pipeline. The pipeline is designed to be reproducible, phase-separated, and provider-agnostic.
Stage 0 — Roundtable
Before any automated testing begins, public servants with direct experience of the use case convene to define what good looks like, validate test scenarios, and confirm the benchmark discriminates meaningfully between good and poor AI behavior.
Stage A — Test Suite
Domain-specific YAML items, each containing: a scenario, ground truth, scoring rubric, and a judge prompt template. All items are published openly on GitHub.
Stage B — Proctor
An automated component submits inputs to the AI system under evaluation and collects verbatim outputs for every test item.
Stage C — Judge
A separate LLM — always different from the system under test — scores each output against the rubric and reports a confidence level alongside the score.
Stage D — Reporter
Scores are aggregated by dimension and risk level. A PDF report card is generated with full methodology disclosure, flagged items, and confidence intervals.
How scores are calculated
Scoring framework
Each test item has a weighted rubric with criteria that sum to 100 points. The judge LLM scores each criterion independently and reports a 0–100 overall score alongside a confidence level.
Confidence gating
Items where confidence falls below 0.70, or where the score is below 50, are automatically flagged for human review before appearing in reports. Flagged items are disclosed in the report card.
The judge model operates at temperature 0 for reproducibility. It is instructed to score against the rubric only — it has no knowledge of the chatbot vendor or city identity during scoring.
Grading scale — used across all benchmarks:
| Grade | Score |
|---|---|
| A | ≥ 90 |
| B | 80–89 |
| C | 70–79 |
| D | 60–69 |
| F | < 60 |
Auditability
Reproducibility
Every benchmark run stores a complete provenance record:
- The exact test items and rubric version used
- The judge model, provider, and temperature settings
- All raw system outputs (verbatim)
- All per-item scores and confidence levels
- Any ground truth answers and verification findings
- Timestamp of the run
All of this is included in the PDF report and accessible via the results page. Public Bench does not edit or redact run records.
Conflicts of interest
Independence
Public Bench does not accept payment from AI vendors. We do not offer paid certification or premium placement on leaderboards. We do not allow vendors to review benchmark results before they are published.
Judge models are always selected from a different provider than the system under test. If a system is built on Anthropic Claude, the judge will be a Google or OpenAI model, and vice versa.
Our commitment
If you believe a benchmark is flawed, unfair, or incomplete, our methodology and rubrics are public. File an issue on GitHub or contact us directly. We will investigate and respond publicly.
Active Benchmarks
Benchmarks currently available
Each benchmark follows the pipeline described above with a use-case-specific test suite, rubric, and ground truth framework developed through a practitioner roundtable.
Live
311 Chatbots
20-item evaluation suite covering task performance, safety, and accessibility for AI-powered 311 service chatbots.
View benchmark →Roundtable pending
Generative AI for Police Reports
Benchmark in development. Practitioner roundtable not yet scheduled.
Volunteer for the roundtable →