Document 1 of 7 - For Agencies

Evaluation Framework for AI-Powered Training

A vendor-neutral procurement framework for evaluating AI-assisted training, with a closing assessment of CodeBlu against the criteria.

On this page

1. Purpose and how to use this document
2. The structural risks specific to AI training
3. Evaluation criteria
4. Disqualifying findings
5. Questions to ask any vendor
6. Scoring approach
7. How CodeBlu measures up
Sources

A vendor-neutral procurement reference, with a closing assessment of CodeBlu against the criteria.

Prepared for training coordinators, procurement officers, and chief executives evaluating AI-assisted de-escalation and crisis-intervention training products.

1. Purpose and how to use this document

A growing number of vendors now sell "AI-powered" training to law enforcement agencies. The label covers a wide range of products: branching video simulators, large-language-model chat tutors, voice-based role-play, and analytics layered on top of conventional courseware. Some are well built. Some are conventional e-learning with a marketing refresh. The buyer's problem is that the term "AI" does not, by itself, tell a procurement officer anything about instructional quality, data security, or whether the product will hold up under scrutiny from a city council or a risk pool.

This document is a framework for evaluating any AI-assisted training product, written so it can be used regardless of which vendor an agency is considering. Sections 2 through 6 are vendor-neutral: the evaluation criteria, the questions to put to every vendor, and a scoring approach. Section 7 applies the framework to CodeBlu specifically, including where CodeBlu does not yet meet a criterion. An agency can use Sections 2 through 6 as the skeleton of an RFP or an internal evaluation memo and disregard Section 7 if CodeBlu is not under consideration.

Two principles run through the framework. First, the burden of proof is on the vendor. A claim that cannot be evidenced should be scored as if it were not made. Second, AI-assisted training is an addition to a training program, not a replacement for the judgment of a qualified instructor or for live, in-person practice. A product that positions itself as a full replacement for either should raise a flag, not earn a point.

2. The structural risks specific to AI training

Before the criteria, it is worth naming the failure modes that are specific to this product category. These are the risks that a conventional e-learning evaluation will miss.

Generated content that is wrong or inappropriate. Products that use generative AI to produce feedback, scenario dialogue, or after-action review can produce output that is plausible but incorrect, or that is tonally wrong for a crisis subject. The evaluation must ask how the vendor constrains, reviews, and corrects generated content.

The "automation of judgment" problem. A product that scores an officer's performance is, in effect, making an evaluative judgment. If that score is treated as authoritative, the agency has outsourced a piece of its training evaluation to a vendor's algorithm. The question is not whether the score is useful. It often is. The question is whether the product is honest that the score is a training aid and not a substitute for instructor evaluation.

Data sensitivity. Training scenarios about mental-health crisis, domestic disturbance, and suicide involve sensitive subject matter. Session transcripts may contain content an agency would not want exposed. Where that data goes, who can read it, and how long it is kept are first-order procurement questions, not IT afterthoughts.

Methodology provenance. Some vendors wrap a single existing curriculum and present it as proprietary. Others cite respected frameworks they have no relationship with. An agency should be able to see clearly what the methodology is, where it comes from, and what is and is not claimed.

Vendor durability. Early-stage software vendors fail, get acquired, or pivot. A training product holding an agency's compliance records is infrastructure. The evaluation must account for what happens to the data and the records if the vendor goes away.

3. Evaluation criteria

The criteria below are organized into eight categories. For each, the framework gives the standard to apply and what evidence should be required. Section 6 provides a scoring sheet.

3.1 Pedagogical foundation

The product should rest on an identifiable, defensible instructional approach. De-escalation is a skill, and skills are built through structured practice with feedback, not through passive content consumption. The evaluation should confirm that the product provides realistic practice, timely feedback, and repetition, and that its instructional design is documented rather than implied.

Evidence to require: a written description of the instructional model; the credentials of whoever designed the curriculum; a sample scenario and a sample feedback output.

3.2 Methodology and evidence base

The product's training content should be traceable to recognized work in the field. Respected reference points for de-escalation and crisis-intervention training include the Police Executive Research Forum's ICAT program (Integrating Communications, Assessment, and Tactics), the Crisis Intervention Team model associated with CIT International, and human-factors research of the kind associated with the Force Science Institute. The standard is not that a vendor must license one of these. It is that the vendor can articulate what its content draws on and can distinguish between citing a body of work and being partnered with or endorsed by its authors.

Evidence to require: a methodology statement; clear language on what partnerships or endorsements exist, if any; any independent evaluation of training outcomes the vendor relies on, with the source.

A note on outcome claims. Research on de-escalation training is still maturing. One of the more rigorous evaluations to date studied the Louisville Metro Police Department's adoption of PERF's ICAT curriculum and reported statistically significant reductions in use-of-force incidents and in injuries to officers and community members (Engel, Corsaro, et al., Criminology and Public Policy, 2022). A vendor that claims its own product produces comparable outcomes should be asked for its own evidence. Borrowing the outcome data of an unrelated curriculum is not evidence about the product being sold.

3.3 Technology reliability and quality

The product must work reliably under the conditions the agency will use it in: the agency's bandwidth, its browsers, its hardware, and its number of concurrent users. For voice-based products, audio quality and latency materially affect the training value. For any product depending on third-party AI services, an outage of that service is an outage of the product.

Evidence to require: a live demonstration on agency equipment, not a polished sales video; stated uptime history or a service-level commitment; a description of what happens when an upstream AI service is unavailable.

3.4 Data security and privacy

The product will hold personally identifiable information about officers and potentially sensitive training content. The evaluation must establish where data is stored, how it is encrypted, who can access it, how access is segregated between agencies, how long data is retained, and what subprocessors handle the data. This category is detailed further in the companion document, CodeBlu Security and Privacy: A Reference for Agency IT Decision-Makers, and the questions in Section 5.4 below apply to any vendor.

Evidence to require: a written security overview; a subprocessor list; the data-retention policy; the breach-notification commitment; any third-party audit or attestation, or an honest statement that none exists yet.

3.5 Compliance and records

If the agency intends to use the product toward a state in-service or continuing-education requirement, the product must produce records the agency can actually use. The critical distinction: a training vendor cannot grant state training credit. Credit eligibility is determined by the state regulator and, in many states, by the agency's chief executive. A credible vendor produces defensible completion records and hour summaries and is explicit that the credit decision is the agency's. A vendor that claims its product "is approved for" or "grants" state credit should be asked to produce the regulator's approval in writing.

Evidence to require: a sample completion certificate and hour summary; the export formats available; the vendor's written language on credit, checked for overclaiming.

3.6 Administration and integration

The product must be administrable by the people who will actually run it: a training coordinator, not a software engineer. The evaluation should confirm how users are provisioned and deprovisioned, how a supervisor sees their roster's progress, how data is exported, and what integration, if any, exists with the agency's learning management system or records.

Evidence to require: a walkthrough of the administrator interface; the user-import and export process; a list of any integrations.

3.7 Vendor viability and support

The agency is entering a relationship, not making a one-time purchase. The evaluation should consider how long the vendor has operated, who stands behind it, what support is included, and what the exit looks like: how the agency gets its records out and what happens to its data if the contract ends or the vendor fails.

Evidence to require: the legal entity and its history; references from comparable agencies, if any exist; the support model; the data-export-on-exit and data-deletion commitments in writing.

A new vendor is not disqualified by being new. Much useful public-safety software comes from small companies. But newness should be priced into the decision: a shorter initial term, a documented exit path, and retained copies of compliance records are reasonable protections.

3.8 Total cost and contract terms

The evaluation should capture the full cost over a realistic horizon, not the headline per-seat price: implementation effort, administrator time, the cost of any hardware, and the staff time officers spend training. Contract terms matter as much as price: the term length, renewal and price-escalation terms, the termination provisions, and whether the vendor's standard terms are workable for a government buyer. Many agencies cannot agree to indemnify a vendor or to binding arbitration; a vendor whose contract assumes a commercial customer will need to negotiate.

Evidence to require: a written quote with all fees itemized; the standard contract or terms of service; the per-seat price held for the contract term.

4. Disqualifying findings

Some findings should end an evaluation regardless of how the product scores elsewhere. An agency should treat the following as disqualifying unless the vendor resolves them:

The vendor claims a partnership, certification, accreditation, or endorsement it cannot evidence in writing.
The vendor claims its product grants or guarantees state training credit, and cannot produce the regulator's written approval.
The vendor cannot say where agency data is stored or which subprocessors handle it.
The vendor positions the product as a full replacement for live, in-person training or for instructor evaluation.
The vendor will not provide a written security overview or a data-export-on-exit commitment.
The product's generated content cannot be reviewed or corrected by the agency or the vendor.

5. Questions to ask any vendor

The following questions are written to be dropped directly into an RFP or a vendor interview. They are grouped to match the criteria above.

5.1 Pedagogy and methodology

Describe your instructional model. Who designed the curriculum, and what are their qualifications?
What recognized frameworks or research does your content draw on? For each, state plainly whether you are partnered with, certified by, or endorsed by the source organization, or whether you are citing public work.
What evidence do you have that your product improves officer performance? Distinguish evidence about your product from evidence about a curriculum you reference.
How is generated content, including feedback and scenario dialogue, reviewed and corrected?

5.2 Technology and reliability

What third-party services does the product depend on? What happens to the product when one of them is unavailable?
Provide uptime history or a service-level commitment.
We will require a live demonstration on our own equipment and network. Confirm you can provide one.

5.3 Compliance and records

What completion records and training-hour summaries does the product produce? Provide samples and the export formats.
State your position on state training credit in writing. Do you claim your product grants credit, or do you provide records the agency uses to seek credit?

5.4 Data security and privacy

Where is agency data stored, and in what country?
List every subprocessor that handles agency data and what each one does.
Are session transcripts or recordings sent to any third-party AI service? Can that service retain or train on the data?
How is data encrypted in transit and at rest? How is one agency's data segregated from another's?
What is the default data-retention period, and is it configurable?
Do you hold any third-party security audit or attestation, such as SOC 2? If not, what is your timeline?
Describe your breach-notification commitment.

5.5 Vendor, support, and contract

What is the legal entity behind the product, and how long has it operated?
Provide references from agencies comparable to ours, if any.
What support is included, and what are the response times?
If we terminate or you cease operations, how do we retrieve our records, and what happens to our data?
Provide your standard contract. Identify any terms, such as indemnification or arbitration, that a government customer typically cannot accept.

6. Scoring approach

Score each of the eight categories in Section 3 on a 0 to 5 scale, where 0 means no acceptable evidence was provided, 3 means the criterion is met, and 5 means the criterion is met with strong, independently verifiable evidence. Weight the categories to the agency's priorities; a suggested default weighting follows.

Category	Suggested weight
Pedagogical foundation	15%
Methodology and evidence base	15%
Technology reliability and quality	15%
Data security and privacy	20%
Compliance and records	10%
Administration and integration	10%
Vendor viability and support	10%
Total cost and contract terms	5%

A category scored 0 or 1 should trigger a written explanation in the evaluation memo regardless of the weighted total, because a single weak category can be the one that matters. A disqualifying finding from Section 4 ends the evaluation regardless of score. The output of the process should be a short memo a chief executive can take to a training committee or governing body: the weighted score, the categories that scored low and why, the disqualifying findings checked and cleared, and a recommendation.

7. How CodeBlu measures up

This section applies the framework to CodeBlu. It is written to the same standard as the rest of the document, including where CodeBlu does not meet a criterion. CodeBlu is operated by CodeBlu LLC, based in Colorado.

Pedagogical foundation. CodeBlu is a voice-based practice product. An officer holds a spoken conversation with an AI agent playing a person in crisis, then receives an after-action review. This is practice-with-feedback, which is the correct instructional shape for a skill. The scenarios are calibrated using CodeBlu's Thought, Emotion, Behavior framework, which sets the simulated subject's emotional state and level of cooperation. Assessment: meets the criterion. An agency should still ask for the curriculum designer's qualifications and review a sample scenario directly.

Methodology and evidence base. CodeBlu describes its methodology as a synthesis of several public bodies of work: the Force Science Institute, CIT International and the Memphis Model, PERF's ICAT, Colorado's Crisis Response and Intervention Training, and the Crisis Intervention Team Association of Colorado. CodeBlu states plainly that it claims no partnership, certification, or endorsement from any of them. That is the correct posture. Caveat: CodeBlu's own source attributions are, by the company's internal account, not yet verified against the primary materials, and CodeBlu is a new product without independent outcome evidence of its own. Assessment: partially meets the criterion. Treat the methodology as credible in shape but ask for the verified attribution detail and do not expect product-specific outcome data yet.

Technology reliability and quality. CodeBlu is a web application hosted on managed infrastructure, using a third-party voice AI service for the conversations and a third-party AI model for after-action review. The dependency on those services is real: an outage of the voice provider is an outage of the training. CodeBlu is early-stage and does not yet publish an uptime history. Assessment: require a live demonstration on agency equipment and ask for a service-level commitment before contracting.

Data security and privacy. CodeBlu stores data in United States infrastructure, uses a managed Postgres database with row-level security to segregate each agency's data, encrypts data in transit, and applies a configurable retention period defaulting to 90 days. Session transcripts are sent to a third-party AI provider to generate the after-action review. CodeBlu does not currently hold a SOC 2 attestation; SOC 2 is on its stated roadmap. CJIS alignment is described by CodeBlu as planned, not implemented. Assessment: the architecture is reasonable and the company is candid about its limits, but the absence of a current SOC 2 attestation and the transcript flow to an AI subprocessor are real findings. See the companion security document.

Compliance and records. CodeBlu produces completion certificates with verification codes, milestone certificates, and per-year training-hour summaries with an export suitable for an agency's own reporting. CodeBlu states explicitly that it does not and cannot grant state training credit, and that credit eligibility rests with the agency's chief executive. That is the correct and defensible position. Assessment: meets the criterion.

Administration and integration. CodeBlu provides an administrator and trainer dashboard with a roster view, officer drill-down, and a roster import. It does not currently offer a published integration with external learning management systems. Assessment: meets the criterion for standalone use; confirm the export formats fit the agency's records workflow.

Vendor viability and support. CodeBlu is a new product from a small company. This is not disqualifying, but it should be priced into the decision with a shorter initial term and a written data-export-on-exit commitment. Assessment: require the exit and deletion commitments in writing.

Total cost and contract terms. CodeBlu's published pricing is a free 30-day department pilot, flat annual department licenses by seat band for standard agency accounts, and custom pricing for larger agencies. CodeBlu's public terms of service are, by the company's own account, still in legal review, with several provisions a government buyer would scrutinize left deliberately unfinished. Assessment: obtain a written quote and a contract a government buyer can accept before proceeding; expect to negotiate the standard terms.

Summary. CodeBlu fits the framework well on pedagogy, compliance posture, and honesty about its own limits, and is mid-stage on security maturity, outcome evidence, and contract readiness. It is a reasonable candidate for a pilot, evaluated with the same rigor this framework applies to any vendor, and not a finished enterprise platform. The companion documents in this set address security, RFP responses, return on investment, comparison to traditional training, and implementation in detail.

Sources

Police Executive Research Forum, ICAT: Integrating Communications, Assessment, and Tactics: https://www.policeforum.org
CIT International (Crisis Intervention Team): https://www.citinternational.org
Force Science Institute: https://www.forcescience.com
Engel, R., Corsaro, N., et al., evaluation of de-escalation training, Criminology and Public Policy (2022), Wiley Online Library: https://onlinelibrary.wiley.com/journal/17459133
AICPA, SOC 2 reporting overview: https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-2
FBI Criminal Justice Information Services (CJIS) Security Policy Resource Center: https://le.fbi.gov/informational-tools/cjis/cjis-security-policy-resource-center

This document is a procurement aid. It is not legal advice. An agency should route contract and data-handling questions to its own counsel.

Talk to us

For pricing, a structured pilot, or any question this document does not answer, email sales@codeblu.co. For security, privacy, or named-subprocessor questions, email privacy@codeblu.co.

Start a pilot conversation