Independent AI Benchmarking · Municipal Government

The AI cities buy in the next five years will define public services for the next twenty.

Public Bench produces transparent, peer-validated report cards that public servants can use to procure, manage, and trust the AI their cities depend on.

Free · Takes about 10–15 minutes · PDF report delivered by email

Theory of Change

A three-horizon path from one benchmark to a national standard.

H1

Now → 18 mo

Prove the model

One benchmark, one open-source release, end-to-end credibility on 311.

H2

2 – 4 yrs

Build the library

A growing catalog of expert-validated benchmarks embedded in procurement workflows.

H3

5+ yrs

Shift the market

Performance standards, cooperative purchasing, and shareable contracts that make trustworthy AI the default in government.

How it works

A reproducible pipeline. The same code runs every benchmark we ship.

Stage 0

Roundtable

Public servants convene to define the use case, validate test scenarios, and prove the benchmark discriminates meaningfully.

Volunteer →

Stage A

Test Suite

Domain-specific YAML items: scenario, ground truth, rubric, judge prompt.

Stage B

Proctor

Submits inputs to the AI under evaluation; collects verbatim outputs.

Stage C

Judge

A separate LLM scores outputs against the rubric and reports confidence.

Stage D

Reporter

Aggregates by dimension and risk; produces an A–F report card.

Provider and platform agnostic Anthropic, Google, OpenAI, and custom / proprietary AI.
Phase-separated Re-run any phase without the others.
Fully reproducible Every score is auditable end-to-end.

Get started

From setup to report card in minutes.

Select your use case to begin. More benchmarks are added as roundtables complete and testing protocols are validated.

311 Chatbots

01 — Setup

Tell us about your city

Provide your contact information and answer a short set of policy questions specific to your use case. Your answers become the ground truth for scoring.

02 — Benchmark

Testing scenarios run automatically

Our pipeline tests your system across task performance, safety, and accessibility scenarios. Automated with flags for human review.

03 — Report

A–F grade with full breakdown

A PDF report card with an overall grade, dimension scores, flagged items, and methodology disclosure. Shareable with leadership.