Scoring Methodology

CostGuard Safety Score

The CostGuard Safety Score (CSS) is a deterministic, heuristic-based measure of how resistant a prompt is to adversarial exploitation and operational failure. This page documents exactly how the score is defined, computed, benchmarked, and versioned.

Spec v1.1.0Calibration history →

1. What CostGuard Safety Score Measures

Range0–100

DirectionHigher = safer

FormulaCSS = 100 − risk_score

CSS measures structural resistance to five classes of adversarial and operational risk: prompt injection, system override, jailbreak behavior, token cost explosion, and tool abuse. A higher score indicates a prompt is structurally isolated, explicitly constrained, and less likely to produce runaway costs or exploitable behavior in production.

2. The Five Scoring Components

Prompt Injection

internal component: structural

Structural susceptibility to authority confusion or role override via untrusted input paths. Evaluated by detecting absence of explicit separators between system instructions and user content.

Structural indicators

· No separator between system instructions and user input
· Absence of explicit role or boundary markers
· Open-ended instruction blocks accepting raw user content

System Override

internal component: structural

Susceptibility to instruction hijacking — the ability of embedded content to override system-level directives. Detected by absence of output format constraints and section delimiters.

Structural indicators

· No explicit output format instruction
· No output constraints (max, limit, exactly)
· No section headers isolating system context

Jailbreak Behavior

internal component: ambiguity

Open-ended output directives and underspecified constraint language that enable constraint bypass. Measured by ambiguous qualitative term density.

Structural indicators

· Qualitative modifiers without concrete definitions (improve, optimize, better, high quality)
· Missing refusal boundaries
· Absence of explicit format requirements

Token Cost Explosion

internal component: length + context + volatility

Risk that a prompt triggers unbounded or disproportionate token generation. Driven by prompt length relative to context window, context saturation, and open-ended output directives.

Structural indicators

· Phrases: write a detailed, comprehensive, in depth, as much as possible
· Expected output tokens exceeding 2× input tokens
· Prompt length over 50% of the model's context window

Tool Abuse

internal component: ambiguity + structural

Structural ambiguity that leads to unpredictable tool invocations in agentic systems. High ambiguity density combined with absent output constraints creates unintended tool calls.

Structural indicators

· High instruction ambiguity density
· No explicit output format instruction
· Absence of scope constraints on tool use

3. Score Bands

Safe85–100

Prompt is structurally sound and resistant to exploitation.

Low70–84

Meets baseline requirements with minor structural gaps.

Warning40–69

Structural weaknesses that should be addressed before deployment.

High0–39

High exploitation risk. Do not deploy without remediation.

Band boundaries are fixed per specification version. A band boundary change requires a major version increment, documented rationale, and full benchmark review.

4. How Threat Intelligence Affects Scores

CostGuardAI aggregates anonymized structural incident patterns into a global threat intelligence database. When a prompt's structural signature matches a known high-risk pattern, a bounded additive adjustment is applied to the base risk_score.

Threat intelligence influence is additive but band-limited. No single pattern match can reduce CSS by more than 10 points. This prevents disproportionate score swings from any single signal.

Pattern matching is performed on structural hashes — no raw prompt text is used in matching. See Section 8 for privacy details.

5. How Prompt CVEs Influence Scores

Prompt CVEs (format: PCVE-YYYY-XXXX) are generated when a structural pattern accumulates 25 or more observed incidents. A CVE match applies a bounded risk adjustment:

Critical+10 pts to risk_scoreCSS decreases by up to 10

High+7 pts to risk_scoreCSS decreases by up to 7

Medium+3 pts to risk_scoreCSS decreases by up to 3

The final risk_score is capped at 100 before computing CSS. No CVE match alone can push a prompt below the Unsafe band boundary.

View Prompt CVE Explorer →

6. Benchmark Calibration

CostGuardAI maintains a canonical benchmark suite of structural fixtures spanning all five risk categories. The benchmark suite is run against every scoring engine change to verify that:

· Each fixture's risk_score falls within its expected range
· No fixture crosses a score band boundary unintentionally
· The overall pass rate remains 100% before release

Benchmark summaries are persisted as versioned JSON artifacts in artifacts/benchmarks/ for longitudinal calibration tracking.

View calibration history →

7. Versioning and Score Stability

patch bumpBug fix with no scoring behavior change

minor bumpNew ambiguity term or volatility phrase added to catalogs

major bumpWeight or bucket threshold change — full benchmark review required

Every API response and shareable report includes analysis_version, score_version, and ruleset_hash for independent verification.

8. Why the Score is Privacy-Safe

CostGuard Safety Score is computed from structural characteristics of prompts — not from their semantic content. Threat intelligence pattern matching uses structural hashes, not raw text.

· No raw prompt text is stored in threat intelligence records
· Pattern hashes use token bands (xs/sm/md/lg/xl/xxl) — not exact token counts
· Prompt CVEs expose only structural category, severity, and incident counts
· Pattern hashes are one-way — they cannot be reversed to recover prompt content

Run your own preflight →Prompt CVE Explorer →