Scoring Methodology
CostGuard Safety Score
The CostGuard Safety Score (CSS) is a deterministic, heuristic-based measure of how resistant a prompt is to adversarial exploitation and operational failure. This page documents exactly how the score is defined, computed, benchmarked, and versioned.
1. What CostGuard Safety Score Measures
CSS measures structural resistance to five classes of adversarial and operational risk: prompt injection, system override, jailbreak behavior, token cost explosion, and tool abuse. A higher score indicates a prompt is structurally isolated, explicitly constrained, and less likely to produce runaway costs or exploitable behavior in production.
2. The Five Scoring Components
internal component: structural
Structural susceptibility to authority confusion or role override via untrusted input paths. Evaluated by detecting absence of explicit separators between system instructions and user content.
Structural indicators
- · No separator between system instructions and user input
- · Absence of explicit role or boundary markers
- · Open-ended instruction blocks accepting raw user content
internal component: structural
Susceptibility to instruction hijacking — the ability of embedded content to override system-level directives. Detected by absence of output format constraints and section delimiters.
Structural indicators
- · No explicit output format instruction
- · No output constraints (max, limit, exactly)
- · No section headers isolating system context
internal component: ambiguity
Open-ended output directives and underspecified constraint language that enable constraint bypass. Measured by ambiguous qualitative term density.
Structural indicators
- · Qualitative modifiers without concrete definitions (improve, optimize, better, high quality)
- · Missing refusal boundaries
- · Absence of explicit format requirements
internal component: length + context + volatility
Risk that a prompt triggers unbounded or disproportionate token generation. Driven by prompt length relative to context window, context saturation, and open-ended output directives.
Structural indicators
- · Phrases: write a detailed, comprehensive, in depth, as much as possible
- · Expected output tokens exceeding 2× input tokens
- · Prompt length over 50% of the model's context window
internal component: ambiguity + structural
Structural ambiguity that leads to unpredictable tool invocations in agentic systems. High ambiguity density combined with absent output constraints creates unintended tool calls.
Structural indicators
- · High instruction ambiguity density
- · No explicit output format instruction
- · Absence of scope constraints on tool use
3. Score Bands
Prompt is structurally sound and resistant to exploitation.
Meets baseline requirements with minor structural gaps.
Structural weaknesses that should be addressed before deployment.
High exploitation risk. Do not deploy without remediation.
Band boundaries are fixed per specification version. A band boundary change requires a major version increment, documented rationale, and full benchmark review.
4. How Threat Intelligence Affects Scores
CostGuardAI aggregates anonymized structural incident patterns into a global threat intelligence database. When a prompt's structural signature matches a known high-risk pattern, a bounded additive adjustment is applied to the base risk_score.
Threat intelligence influence is additive but band-limited. No single pattern match can reduce CSS by more than 10 points. This prevents disproportionate score swings from any single signal.
Pattern matching is performed on structural hashes — no raw prompt text is used in matching. See Section 8 for privacy details.
5. How Prompt CVEs Influence Scores
Prompt CVEs (format: PCVE-YYYY-XXXX) are generated when a structural pattern accumulates 25 or more observed incidents. A CVE match applies a bounded risk adjustment:
The final risk_score is capped at 100 before computing CSS. No CVE match alone can push a prompt below the Unsafe band boundary.
View Prompt CVE Explorer →6. Benchmark Calibration
CostGuardAI maintains a canonical benchmark suite of structural fixtures spanning all five risk categories. The benchmark suite is run against every scoring engine change to verify that:
- · Each fixture's risk_score falls within its expected range
- · No fixture crosses a score band boundary unintentionally
- · The overall pass rate remains 100% before release
Benchmark summaries are persisted as versioned JSON artifacts in artifacts/benchmarks/ for longitudinal calibration tracking.
View calibration history →7. Versioning and Score Stability
Every API response and shareable report includes analysis_version, score_version, and ruleset_hash for independent verification.
8. Why the Score is Privacy-Safe
CostGuard Safety Score is computed from structural characteristics of prompts — not from their semantic content. Threat intelligence pattern matching uses structural hashes, not raw text.
- · No raw prompt text is stored in threat intelligence records
- · Pattern hashes use token bands (xs/sm/md/lg/xl/xxl) — not exact token counts
- · Prompt CVEs expose only structural category, severity, and incident counts
- · Pattern hashes are one-way — they cannot be reversed to recover prompt content