Benchmark

311 Chatbots

An independent benchmark for AI-powered 311 service chatbots used by municipal governments. Transparent scores. Reproducible methodology.

Context

How AI is used in 311 services.

311 is the non-emergency municipal services line — the system residents contact to report potholes, request trash pickup, ask about permit requirements, or get information about city services. AI-powered chatbots are now handling a growing share of these interactions, either as the first point of contact or as a supplement to human operators.

When a 311 chatbot works well, it gives residents fast, accurate answers and routes complex or urgent requests to the right people. When it fails, it fabricates permit rules, mishandles sensitive requests, or gives different answers to English and Spanish speakers. These failures are not hypothetical — they are documented patterns across real deployments.

Public Bench's 311 benchmark is the first independent, use-case-specific evaluation suite for this class of AI system. It tests the failure modes that matter, using scenarios developed with practitioners who operate 311 systems every day.

What we test

Three evaluation dimensions.

01

Task Performance

Does your chatbot give accurate, complete answers to the questions residents actually ask? We test information accuracy, procedural guidance, routing, escalation, and more.

02

Safety & Harm

Does it fail in dangerous ways — fabricating permit rules, exposing private data, or mishandling ambiguous emergency requests? We stress-test the failure modes that matter.

03

Accessibility & Inclusion

Does it serve all residents equally — in English and Spanish, across literacy levels and technology access? Equity isn't optional in public services.

Overview

What we test

Every benchmark run submits 20 test scenarios to the chatbot under evaluation and scores the responses using a separate judge LLM (Claude) against a verified rubric. Scores range from 0–100 and roll up to an A–F letter grade across three dimensions.

Test items are designed in partnership with government technology experts and reviewed for real-world relevance to 311 service delivery. All test items, rubrics, and judge prompts are published openly on GitHub.

Structure

Three evaluation dimensions

Items are grouped into three dimensions that together cover the performance characteristics that matter most for public-facing AI in government services.

10 items

Task Performance

Accuracy, completeness, routing, escalation, and consistency. The core question: does the chatbot answer correctly and completely?

6 items

Safety & Harm

Fabrication, PII exposure, ambiguity handling, outdated information. The failure modes that damage resident trust.

4 items

Accessibility & Inclusion

Language parity (English/Spanish), literacy level, technology access. Does it serve all residents equally?

Dimension Sub-dimensions tested Items
Task Performance Information accuracy, procedural guidance, task routing, completeness, escalation, graceful degradation, consistency, cost efficiency 10
Safety & Harm Confident fabrication, PII exposure, ambiguity mishandling, outdated information 6
Accessibility & Inclusion Language parity, literacy level, technology access 4

How scoring works

Scoring criteria

Each test item has a weighted rubric with criteria that sum to 100 points. The judge LLM scores each criterion independently and reports a 0–100 overall score alongside a confidence level.

Confidence gating

Items where confidence falls below 0.70, or where the score is below 50, are automatically flagged for human review before appearing in reports. Flagged items are disclosed in the report card.

The judge model is Anthropic Claude (claude-opus-4-5), operating at temperature 0 for reproducibility. It is instructed to score against the rubric only — it has no knowledge of the chatbot vendor or city identity during scoring.

Letter grades

Grading scale

Dimension scores and the overall score both map to an A–F letter grade using the same scale. The overall score is an unweighted average across all scored items.

Grade Score range Interpretation
A ≥ 90 Excellent — meets or exceeds expectations across all criteria
B 80–89 Good — strong performance with minor gaps
C 70–79 Acceptable — meets baseline, noticeable weaknesses
D 60–69 Below expectations — significant improvement needed
F < 60 Needs improvement — fails to meet minimum standard

City-specific facts

Ground truth

Municipal chatbots answer city-specific questions — permit fees, repair timelines, collection schedules — that vary by city. Scoring requires verified city-specific ground truth, not a generic rubric.

City-specific facts are provided by city staff at intake via an 11-question policy form and cross-checked by our Researcher (an LLM with web search access) against official city sources. Where the Researcher finds a discrepancy, the staffer is shown both answers and chooses which to use.

CityProfile — frozen ground truth

Once submitted and verified, a city's answers are stored as a versioned CityProfile — a frozen snapshot used as the sole ground truth source for all benchmarks of that city. Profiles expire after 90 days and must be refreshed. This ensures benchmarks are reproducible and comparable over time.

Address-specific facts (garbage and recycling collection schedules) are either provided by the staffer directly or looked up by the Researcher from official city lookup tools using a real residential address.

Auditability

Reproducibility

Every benchmark run stores a complete provenance record:

  • The exact test items and rubric version used
  • The judge model, provider, and temperature settings
  • All raw chatbot responses (verbatim)
  • All per-item scores and confidence levels
  • The city's ground truth answers and Researcher verification findings
  • The CityProfile version used for scoring
  • Timestamp of the run

All of this is included in the PDF report and accessible via the results page. Public Bench does not edit or redact run records.

Current limitations

Beta disclaimer

The current suite contains 20 test items across three dimensions. Statistical confidence intervals reflect this suite size and will narrow as the suite grows toward the 50–100+ items per dimension needed for production-grade precision.

All reports clearly disclose the current suite size and confidence intervals. The rubric, judge prompts, and test items are actively evolving based on expert feedback and real-world benchmarking findings.

What "BETA" means

BETA indicates that the test suite and methodology are actively evolving. Results are valid and useful now, but scores may shift as the rubric is refined. We recommend re-running benchmarks quarterly and tracking trends rather than relying on any single run.

Results

311 Chatbot scores published to date.

No results published yet for this benchmark.

Run the first benchmark →