Benchmark
311 Chatbots
An independent benchmark for AI-powered 311 service chatbots used by municipal governments. Transparent scores. Reproducible methodology.
Context
How AI is used in 311 services.
311 is the non-emergency municipal services line — the system residents contact to report potholes, request trash pickup, ask about permit requirements, or get information about city services. AI-powered chatbots are now handling a growing share of these interactions, either as the first point of contact or as a supplement to human operators.
When a 311 chatbot works well, it gives residents fast, accurate answers and routes complex or urgent requests to the right people. When it fails, it fabricates permit rules, mishandles sensitive requests, or gives different answers to English and Spanish speakers. These failures are not hypothetical — they are documented patterns across real deployments.
Public Bench's 311 benchmark is the first independent, use-case-specific evaluation suite for this class of AI system. It tests the failure modes that matter, using scenarios developed with practitioners who operate 311 systems every day.
What we test
Three evaluation dimensions.
01
Task Performance
Does your chatbot give accurate, complete answers to the questions residents actually ask? We test information accuracy, procedural guidance, routing, escalation, and more.
02
Safety & Harm
Does it fail in dangerous ways — fabricating permit rules, exposing private data, or mishandling ambiguous emergency requests? We stress-test the failure modes that matter.
03
Accessibility & Inclusion
Does it serve all residents equally — in English and Spanish, across literacy levels and technology access? Equity isn't optional in public services.
Overview
What we test
Every benchmark run submits 20 test scenarios to the chatbot under evaluation and scores the responses using a separate judge LLM (Claude) against a verified rubric. Scores range from 0–100 and roll up to an A–F letter grade across three dimensions.
Test items are designed in partnership with government technology experts and reviewed for real-world relevance to 311 service delivery. All test items, rubrics, and judge prompts are published openly on GitHub.
Structure
Three evaluation dimensions
Items are grouped into three dimensions that together cover the performance characteristics that matter most for public-facing AI in government services.
10 items
Task Performance
Accuracy, completeness, routing, escalation, and consistency. The core question: does the chatbot answer correctly and completely?
6 items
Safety & Harm
Fabrication, PII exposure, ambiguity handling, outdated information. The failure modes that damage resident trust.
4 items
Accessibility & Inclusion
Language parity (English/Spanish), literacy level, technology access. Does it serve all residents equally?
| Dimension | Sub-dimensions tested | Items |
|---|---|---|
| Task Performance | Information accuracy, procedural guidance, task routing, completeness, escalation, graceful degradation, consistency, cost efficiency | 10 |
| Safety & Harm | Confident fabrication, PII exposure, ambiguity mishandling, outdated information | 6 |
| Accessibility & Inclusion | Language parity, literacy level, technology access | 4 |
How scoring works
Scoring criteria
Each test item has a weighted rubric with criteria that sum to 100 points. The judge LLM scores each criterion independently and reports a 0–100 overall score alongside a confidence level.
Confidence gating
Items where confidence falls below 0.70, or where the score is below 50, are automatically flagged for human review before appearing in reports. Flagged items are disclosed in the report card.
The judge model is Anthropic Claude (claude-opus-4-5), operating at temperature 0 for reproducibility. It is instructed to score against the rubric only — it has no knowledge of the chatbot vendor or city identity during scoring.
Letter grades
Grading scale
Dimension scores and the overall score both map to an A–F letter grade using the same scale. The overall score is an unweighted average across all scored items.
| Grade | Score range | Interpretation |
|---|---|---|
| A | ≥ 90 | Excellent — meets or exceeds expectations across all criteria |
| B | 80–89 | Good — strong performance with minor gaps |
| C | 70–79 | Acceptable — meets baseline, noticeable weaknesses |
| D | 60–69 | Below expectations — significant improvement needed |
| F | < 60 | Needs improvement — fails to meet minimum standard |
City-specific facts
Ground truth
Municipal chatbots answer city-specific questions — permit fees, repair timelines, collection schedules — that vary by city. Scoring requires verified city-specific ground truth, not a generic rubric.
City-specific facts are provided by city staff at intake via an 11-question policy form and cross-checked by our Researcher (an LLM with web search access) against official city sources. Where the Researcher finds a discrepancy, the staffer is shown both answers and chooses which to use.
CityProfile — frozen ground truth
Once submitted and verified, a city's answers are stored as a versioned CityProfile — a frozen snapshot used as the sole ground truth source for all benchmarks of that city. Profiles expire after 90 days and must be refreshed. This ensures benchmarks are reproducible and comparable over time.
Address-specific facts (garbage and recycling collection schedules) are either provided by the staffer directly or looked up by the Researcher from official city lookup tools using a real residential address.
Auditability
Reproducibility
Every benchmark run stores a complete provenance record:
- The exact test items and rubric version used
- The judge model, provider, and temperature settings
- All raw chatbot responses (verbatim)
- All per-item scores and confidence levels
- The city's ground truth answers and Researcher verification findings
- The CityProfile version used for scoring
- Timestamp of the run
All of this is included in the PDF report and accessible via the results page. Public Bench does not edit or redact run records.
Current limitations
Beta disclaimer
The current suite contains 20 test items across three dimensions. Statistical confidence intervals reflect this suite size and will narrow as the suite grows toward the 50–100+ items per dimension needed for production-grade precision.
All reports clearly disclose the current suite size and confidence intervals. The rubric, judge prompts, and test items are actively evolving based on expert feedback and real-world benchmarking findings.
What "BETA" means
BETA indicates that the test suite and methodology are actively evolving. Results are valid and useful now, but scores may shift as the rubric is refined. We recommend re-running benchmarks quarterly and tracking trends rather than relying on any single run.
Results
311 Chatbot scores published to date.
No results published yet for this benchmark.
Run the first benchmark →