5.5 — Testing AI-Generated Code

Testing & Verification 90 min QA & Security
0:00 / 0:00
Listen instead
Testing AI-Generated Code
0:00 / 0:00

Learning Objectives

  • Articulate why AI-generated code requires fundamentally different testing approaches, supported by quantitative evidence.
  • Identify the key failure modes of AI-generated code and design tests that specifically target each failure mode.
  • Implement mutation testing as the critical quality gate for AI-generated code.
  • Execute a 7-step verification workflow for all AI-generated code before merge.
  • Define and track AI-specific testing metrics to measure organizational risk from AI code generation.
  • Implement provenance tracking, AI BOM, and SBOM practices for AI-generated components.

1. Why AI-Generated Code Needs Different Testing

This is not a theoretical concern. The data is clear.

1.1 The Vulnerability Multiplier

Veracode’s 2025 State of Software Security report analyzed millions of scans across thousands of applications and found that AI-generated code contains 2.74 times more vulnerabilities than human-written code. This is not a marginal increase β€” it is nearly triple the defect rate.

The finding was consistent across languages, frameworks, and vulnerability types. AI-generated code was more likely to contain:

  • Injection vulnerabilities (SQL, command, template)
  • Insecure authentication patterns
  • Missing input validation
  • Improper error handling
  • Hardcoded credentials and secrets

1.2 The Quality Gap

CodeRabbit’s analysis of pull requests (2025) compared AI-generated PRs against human-written PRs using automated code review metrics:

MetricHuman PRsAI PRsDifference
Average issues per PR6.4510.83+68%
Logic/correctness errorsBaseline+75%Significantly higher
Security findingsBaseline+57%More prevalent
Code style violationsBaseline+23%Moderately higher

The +75% increase in logic and correctness errors is particularly concerning because these are the hardest defects to catch with automated tools. A function that compiles, passes linting, and achieves line coverage can still have fundamentally wrong logic.

1.3 The Confidence Paradox

A Stanford University study (2023, replicated in 2025) demonstrated a deeply counterintuitive finding: developers using AI coding assistants produced less secure code but reported higher confidence in the security of their code compared to developers writing code without AI assistance.

The mechanism: AI-generated code looks professional. It follows conventions. It compiles. It has comments. It handles the happy path well. This surface-level quality creates confidence. But surface quality is not the same as correctness, and especially not the same as security.

This confidence paradox means that teams using AI coding tools may actually need MORE rigorous testing than teams writing code manually, precisely because the code looks good enough to slip past casual review.

1.4 The Secret Leakage Problem

An analysis of GitHub Copilot repositories found a 6.4% rate of committed secrets β€” API keys, tokens, credentials, and connection strings. AI models trained on public code repositories have internalized the widespread bad practice of embedding secrets in code and reproduce it in suggestions.

This is not a β€œbad developer” problem β€” it is a systemic problem. The AI suggests embedding an API key because that is what the training data contains. The developer accepts the suggestion because it works. The secret goes to version control because the developer assumed the AI would follow best practices.


2. Key Failure Modes of AI-Generated Code

Understanding HOW AI code fails is essential to designing tests that catch those failures. AI code does not fail like human code. It has distinctive failure patterns.

2.1 Control-Flow Omissions

The most dangerous failure mode. AI-generated code often looks structurally correct β€” it has the right functions, the right return types, the right general flow β€” but skips critical control-flow elements.

What gets omitted:

  • Null checks: Function proceeds with the assumption that input is non-null, crashes or produces wrong results when null is passed.
  • Early returns: Missing guard clauses that should exit early on invalid state.
  • Exception handling: Try blocks without appropriate catch blocks, or catch blocks that swallow exceptions silently.
  • Boundary validation: No check for array bounds, no check for maximum values, no check for negative numbers where only positive are valid.
  • Resource cleanup: Files opened but not closed in error paths, database connections not returned to pool on exception, locks acquired but not released.

Example β€” AI-generated code with control-flow omission:

# AI-generated function
def process_transaction(account_id, amount):
    account = db.get_account(account_id)
    new_balance = account.balance - amount  # NPE if account is None
    account.balance = new_balance           # No check for negative balance
    db.save(account)
    return {"status": "success", "balance": new_balance}

# What was needed
def process_transaction(account_id, amount):
    if amount <= 0:
        raise ValueError("Amount must be positive")

    account = db.get_account(account_id)
    if account is None:
        raise AccountNotFoundError(f"Account {account_id} not found")

    if account.balance < amount:
        raise InsufficientFundsError(
            f"Balance {account.balance} insufficient for {amount}"
        )

    new_balance = account.balance - amount
    account.balance = new_balance

    try:
        db.save(account)
    except DatabaseError as e:
        logger.error(f"Failed to save transaction: {e}")
        raise TransactionFailedError("Transaction could not be completed") from e

    return {"status": "success", "balance": new_balance}

The AI-generated version is 4 lines. The correct version is 16 lines. The AI version has 100% line coverage with a single happy-path test. The AI version has at least 4 exploitable bugs (null dereference, negative balance, no amount validation, silent database failure).

2.2 Subtle Logic Errors That Pass Static Analysis

AI-generated code frequently contains logic errors that are syntactically valid, pass all linting rules, and even pass basic tests β€” but produce wrong results under specific conditions.

Common patterns:

  • Off-by-one errors in loops and boundary checks
  • Incorrect comparison operators (>= instead of >, && instead of ||)
  • Wrong variable used in conditional (checking user_a.id when user_b.id was intended)
  • Inverted boolean logic (function returns true when it should return false)
  • Incorrect order of operations in complex expressions

These errors are invisible to SAST because the code is syntactically correct. They are invisible to standard tests because developers test the same happy paths the AI optimized for. They are only caught by:

  • Mutation testing (which flips these exact conditions)
  • Thorough boundary testing (which exercises the edges where off-by-one errors manifest)
  • Manual code review by developers who understand the business logic

2.3 Security Guardrails Assumed but Not Implemented

AI models generate code based on patterns in their training data. If the training data frequently imports a security library and calls a sanitization function, the AI may assume those are available β€” but not actually import or call them.

Example patterns:

  • Code references a CSRF token but never validates it
  • Code mentions input sanitization in comments but uses raw input
  • Code imports an authentication decorator but does not apply it to all endpoints
  • Code references environment variables for secrets but also has a hardcoded fallback

2.4 Training Data Pattern Mismatch

AI models suggest solutions based on patterns from millions of code repositories. But those patterns may not fit the current context:

  • Framework version differences (AI suggests deprecated API calls)
  • Organizational coding standards (AI follows internet conventions, not your standards)
  • Architecture mismatches (AI generates monolithic patterns in a microservice codebase)
  • Security policy conflicts (AI uses JWT when the organization uses session-based authentication)

2.5 Dependency on Outdated or Vulnerable Libraries

AI models have a training data cutoff. They suggest libraries based on what was popular during training, which may be:

  • Deprecated or unmaintained
  • Known vulnerable (CVEs published after training cutoff)
  • License-incompatible with the project
  • Replaced by better alternatives

3. Mutation Testing as the Critical Quality Gate

3.1 Why Coverage Metrics Are Misleading for AI Code

When a developer writes code and then writes tests, the tests tend to cover the edge cases the developer was thinking about while writing the code. There is a mental model shared between the code and the tests.

When AI generates code and a developer (or AI) writes tests, this shared mental model does not exist. Tests cover the obvious paths β€” the ones that look like they need testing. The subtle paths β€” null handling, error recovery, boundary conditions β€” are untested because neither the AI nor the developer focused on them.

The result: AI-generated code can achieve 85%+ line coverage with tests that would miss 60%+ of injected mutations. The coverage number looks good. The test suite is ineffective.

3.2 Meta’s Research: LLMs Make Mutation Testing Practical

Historically, mutation testing was too computationally expensive for widespread adoption. Meta’s research demonstrated that LLMs fundamentally change this equation:

  1. LLMs predict surviving mutants: Instead of running every mutant against the test suite, an LLM can analyze the code and tests to predict which mutants will survive. This reduces computation by 60-80%.
  2. LLMs generate targeted tests: For surviving mutants, the LLM generates tests specifically designed to kill them. This closes the feedback loop without requiring manual test writing.
  3. LLMs identify equivalent mutants: Some mutations produce semantically identical code (e.g., reordering independent statements). LLMs can identify these, eliminating false negatives from the mutation score.

This makes mutation testing practical at Meta’s scale β€” millions of lines of code, continuous integration, thousands of changes per day. If it works at Meta’s scale, it works at yours.

3.3 MutGen: Mutation Feedback Loops

MutGen formalizes the LLM + mutation testing feedback loop:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                  β”‚
β”‚   1. Code Unit Under Test                        β”‚
β”‚          β”‚                                       β”‚
β”‚          β–Ό                                       β”‚
β”‚   2. LLM generates initial test suite            β”‚
β”‚          β”‚                                       β”‚
β”‚          β–Ό                                       β”‚
β”‚   3. Mutation testing evaluates tests            β”‚
β”‚          β”‚                                       β”‚
β”‚          β”œβ”€β”€ All mutants killed β†’ DONE           β”‚
β”‚          β”‚                                       β”‚
β”‚          β–Ό (surviving mutants)                   β”‚
β”‚   4. Feed survivors back to LLM                  β”‚
β”‚      "These mutations survived. Generate          β”‚
β”‚       tests that detect them."                   β”‚
β”‚          β”‚                                       β”‚
β”‚          β–Ό                                       β”‚
β”‚   5. LLM generates targeted tests                β”‚
β”‚          β”‚                                       β”‚
β”‚          └── Go to step 3                        β”‚
β”‚                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Research shows MutGen achieves 15-30% higher mutation scores compared to either standalone LLM test generation or standalone mutation testing. The combination is more than the sum of its parts.

3.4 Target Mutation Scores for AI-Generated Code

Code CategoryMinimum Mutation ScoreRationale
AI-generated, security-criticalβ‰₯70%Authentication, authorization, data handling, crypto
AI-generated, business logicβ‰₯60%Financial calculations, workflow logic, data processing
AI-generated, general utilityβ‰₯50%Helper functions, formatters, UI logic
Human-written, security-criticalβ‰₯70%Same standard regardless of author
Human-written, generalβ‰₯40%Baseline for mature testing programs

The higher thresholds for AI-generated code reflect the empirically measured 2.74x vulnerability rate. The code is statistically less reliable, so the verification bar must be proportionally higher.


Every AI-generated code contribution must pass through this workflow before merge. No shortcuts. No exceptions for β€œsimple” changes (simple AI-generated changes are where the most subtle bugs hide).

Step 1: AI Generates Code

The developer uses an AI coding assistant (Copilot, Claude, Cursor, etc.) to generate code. The code is generated in a feature branch.

At this step: The developer captures provenance metadata β€” which AI model, which prompt, which version of the tool. This metadata is stored for traceability (Step 6).

Step 2: Developer Reviews for Functional Correctness

The developer who requested the AI-generated code reviews it for:

  • Does it solve the stated problem?
  • Does it handle edge cases?
  • Does it follow organizational coding standards?
  • Does the logic make sense for the business context?
  • Are there control-flow omissions (null checks, error handling, resource cleanup)?

This is not optional. The developer is accountable for all code they submit, regardless of who or what generated it. β€œThe AI wrote it” is not an acceptable response when a bug reaches production.

Step 3: Automated SAST/DAST Scan

The standard security scanning pipeline (Module 5.2) runs against the AI-generated code:

  • SAST: Semgrep, Checkmarx, or equivalent scans for known vulnerability patterns.
  • SCA: Dependency check on any new libraries the AI introduced.
  • Secrets detection: scan for hardcoded API keys, tokens, credentials.
  • DAST (if applicable): if the change is deployed to a test environment, dynamic scanning.

AI-specific SAST rules:

# Semgrep rule: flag AI code that handles user input without validation
rules:
  - id: ai-code-unvalidated-input
    patterns:
      - pattern: |
          def $FUNC(..., $INPUT, ...):
              ...
              db.query($INPUT)
    message: "User input passed directly to database query without validation"
    severity: ERROR
    metadata:
      category: security
      ai-relevance: "Common AI-generated pattern: missing input validation"

Step 4: Mutation Testing

This is the critical gate. Run mutation testing on the AI-generated code:

  1. Generate mutants for all AI-generated functions.
  2. Run existing tests against each mutant.
  3. Calculate mutation score.
  4. If score < threshold: FAIL. Generate additional tests (manually or via MutGen) until threshold is met.
# Example with mutmut for Python
mutmut run --paths-to-mutate=src/feature/ai_generated_module.py

# Example with Stryker for TypeScript
npx stryker run --mutate="src/feature/aiGenerated*.ts"

# Check results
# Target: β‰₯60% mutation score for AI-generated code

Step 5: Manual Security Review

A developer with security training (or an AppSec engineer for high-risk code) reviews specifically:

  • Control flow: Are all paths through the code handled? Null cases, error cases, timeout cases, empty input cases?
  • Authentication paths: Does every endpoint that requires authentication actually enforce it?
  • Authorization paths: Are access control checks applied consistently? Can they be bypassed by manipulating parameters?
  • Error handling: Do error paths leak information? Do they fail closed (deny access on error) or fail open (grant access on error)?
  • Input validation: Is every input from an untrusted source validated before use?
  • Output encoding: Is output properly encoded for the context (HTML, SQL, URL, JavaScript)?

This manual review explicitly targets the failure modes described in Section 2.

Step 6: License and Provenance Check

Before merge, verify:

License compliance:

  • Are all dependencies introduced by the AI compatible with your project’s license?
  • Does the AI-generated code resemble any specific open-source code closely enough to raise license concerns?
  • Has the AI introduced any copyleft-licensed dependencies in a proprietary project?

Provenance tracking:

  • Record which AI model generated the code (model name, version).
  • Record the prompt or context that produced the generation.
  • Record the date of generation.
  • Store this metadata in the version control system (git trailers, commit metadata, or a dedicated provenance file).
# Git trailer approach for provenance
git commit -m "Add payment validation logic

AI-Model: Claude Opus 4.6
AI-Tool: Claude Code v1.x
AI-Date: 2026-03-19
Reviewed-By: developer@example.com
Mutation-Score: 67%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>"

Step 7: Peer Review with AI-Generated Tag

The pull request goes through standard peer review with explicit labeling:

  • PR description includes [AI-Generated] or [AI-Assisted] tag.
  • Reviewers are informed that the code was AI-generated and should apply the appropriate level of scrutiny (higher than for human-written code, per the data in Section 1).
  • The mutation score is included in the PR description.
  • SAST/SCA results are linked.
  • Provenance metadata is visible.
## Pull Request: Add payment validation logic [AI-Generated]

**AI Model**: Claude Opus 4.6
**Mutation Score**: 67% (target: β‰₯60%)
**SAST**: 0 critical, 0 high, 2 medium (both false positives, documented)
**SCA**: No new dependencies
**Manual Security Review**: Completed by @security-engineer, no findings

### Changes
- Added `validate_payment()` function with input validation, amount bounds checking, and currency validation
- Added 24 unit tests including 8 boundary tests and 6 negative tests

5. Provenance Tracking and Attribution

5.1 AI BOM β€” Extending SBOM for AI

Traditional Software Bill of Materials (SBOM) formats β€” SPDX and CycloneDX β€” were designed to track software components: libraries, packages, modules. They do not account for AI-specific components.

An AI Bill of Materials (AI BOM) extends the traditional SBOM with:

ComponentWhat to Track
AI models usedModel name, version, provider, parameter count
Training data provenanceData sources, licenses, known biases, collection methods
EmbeddingsEmbedding models, vector databases, index versions
Orchestration layersLangChain/LlamaIndex versions, chain configurations
Retraining processesFine-tuning datasets, training parameters, evaluation metrics
AI-generated codeFunctions/modules generated by AI, model and prompt used

5.2 CISA AI SBOM Use Case Guide

The Cybersecurity and Infrastructure Security Agency (CISA) published guidance on AI-specific SBOM use cases, recognizing that traditional SBOMs are insufficient for AI systems. Key recommendations:

  • Track all AI models as components in the SBOM.
  • Include model cards (standardized documentation of model capabilities, limitations, and intended use) as SBOM metadata.
  • Document the provenance of training data, including data licensing and consent.
  • Track model versions as distinct components (a retrained model is a new component, not an update).

5.3 SBOM Mandates in 2026

SBOM requirements are tightening globally:

  • US Executive Order 14028 (2021) mandated SBOMs for software sold to the federal government. Enforcement has tightened steadily since.
  • EU Cyber Resilience Act (effective 2025-2026) requires SBOMs for all products with digital elements sold in the EU.
  • EU AI Act GPAI (General Purpose AI) obligations require provenance audits for AI models and training data.
  • Industry standards (PCI-DSS v4.0, NIST CSF 2.0) increasingly reference SBOMs as expected practice.

5.4 The 25% Visibility Gap

Research indicates that 25% of organizations are unsure which AI services and datasets are active in their environment. AI tools are adopted bottom-up β€” individual developers sign up for AI coding assistants, data scientists deploy models from Hugging Face, operations teams use AI-powered monitoring tools β€” without centralized inventory.

This is a governance failure. You cannot secure what you cannot see. AI BOM practices address this by requiring every AI component to be inventoried, documented, and tracked through its lifecycle.


6. SBOM for AI Components

6.1 Limitations of Traditional SBOM

SPDX and CycloneDX can express software package dependencies with version numbers, licenses, and known vulnerabilities. They cannot natively express:

  • Model weights (binary blobs with no version semantics)
  • Training data provenance (what data was the model trained on? Under what license? With what consent?)
  • Embedding dimensions and similarity metrics (critical for understanding AI system behavior)
  • Inference configuration (temperature, top-p, system prompts β€” all affect AI behavior)
  • Model-to-model dependencies (a RAG system depends on both an LLM and an embedding model)

6.2 AI BOM Additional Fields

Proposed extensions to standard SBOM formats for AI components:

{
  "ai_components": [
    {
      "name": "Claude Opus 4.6",
      "type": "llm",
      "provider": "Anthropic",
      "version": "claude-opus-4-6",
      "purpose": "code generation",
      "training_data": {
        "cutoff_date": "2025-05",
        "known_biases": "See model card",
        "license": "Anthropic Terms of Service"
      },
      "usage": {
        "functions_generated": [
          "src/payment/validate.py:validate_payment",
          "src/auth/mfa.py:verify_totp"
        ],
        "generation_date": "2026-03-19",
        "prompt_hash": "sha256:abc123...",
        "review_status": "peer_reviewed"
      }
    },
    {
      "name": "nomic-embed-text",
      "type": "embedding_model",
      "provider": "Nomic AI",
      "version": "1.5",
      "purpose": "semantic search",
      "dimensions": 768,
      "license": "Apache 2.0"
    }
  ]
}

6.3 EU AI Act GPAI Obligations

The EU AI Act imposes specific obligations on providers of General Purpose AI models:

  • Transparency: Document the model’s capabilities, limitations, and intended use.
  • Training data: Provide sufficiently detailed summaries of training data, including copyrighted content.
  • Technical documentation: Maintain documentation enabling downstream providers to comply with their obligations.
  • Provenance audits: Enable tracing of AI-generated content and code back to the model that produced it.

For organizations using AI to generate code, this means: if your AI-generated code is part of a product sold in the EU, you must be able to trace which code was AI-generated, which model generated it, and what the model’s training data included.


7. Layered Testing Strategy for AI-Generated Code

A single testing technique is insufficient. AI-generated code requires a layered defense where each layer catches what the previous layers miss.

Layer 1: Traditional Unit and Integration Tests (Baseline)

Standard testing as described in Module 5.1. This is the foundation β€” necessary but insufficient for AI-generated code.

  • Unit tests for individual functions with assertions (not just invocations).
  • Integration tests for component interactions.
  • Coverage gates: β‰₯80% line coverage, β‰₯75% branch coverage.

Layer 2: Mutation Testing (Real Effectiveness Measurement)

As described in Section 3. This is the layer that distinguishes β€œcode was executed” from β€œtests detect defects.”

  • Run mutation testing on all AI-generated modules.
  • Target: β‰₯60% mutation score (β‰₯70% for security-critical code).
  • Surviving mutants are either killed with new tests or documented as accepted risk.

Layer 3: Security-Specific Test Suites (OWASP Patterns)

Targeted security tests based on OWASP vulnerability patterns:

# Example: security test suite for an AI-generated API endpoint
class TestPaymentEndpointSecurity:

    def test_sql_injection_in_amount(self):
        """AI code often passes user input directly to queries"""
        response = client.post("/api/pay", json={"amount": "100; DROP TABLE users"})
        assert response.status_code == 400

    def test_negative_amount(self):
        """AI code often skips business rule validation"""
        response = client.post("/api/pay", json={"amount": -100})
        assert response.status_code == 400

    def test_zero_amount(self):
        """AI code often skips boundary checks"""
        response = client.post("/api/pay", json={"amount": 0})
        assert response.status_code == 400

    def test_exceeds_maximum(self):
        """AI code often omits upper bounds"""
        response = client.post("/api/pay", json={"amount": 999999999999})
        assert response.status_code == 400

    def test_unauthenticated_access(self):
        """AI code sometimes forgets auth middleware"""
        response = client.post("/api/pay", json={"amount": 100})
        # No auth header
        assert response.status_code == 401

    def test_other_users_account(self):
        """AI code often skips authorization checks"""
        response = client.post("/api/pay",
                               json={"amount": 100, "account_id": OTHER_USER_ACCOUNT},
                               headers=auth_header(USER_A))
        assert response.status_code == 403

Layer 4: Fuzz Testing (AI Input Handling)

AI-generated input handling code is particularly vulnerable to unexpected inputs. Fuzz testing generates thousands of random, malformed, and adversarial inputs.

  • Tools: Atheris (Python), go-fuzz (Go), AFL (C/C++), jazzer (Java)
  • Focus areas: All AI-generated functions that accept external input (HTTP parameters, file uploads, API payloads, CLI arguments)
  • Duration: Run for at least 1 hour per target function (time-boxed)

Layer 5: Manual Review of Control Flow and Error Handling

Human review targeting the specific failure modes in Section 2:

Checklist for reviewers:

  • All input parameters validated for type, range, and format
  • Null/None/undefined checked before dereferencing
  • All error paths return appropriate responses (not stack traces, not raw errors)
  • Resources (files, connections, locks) released in all paths including error paths
  • Authentication enforced on all endpoints that require it
  • Authorization checked for the specific resource being accessed
  • No hardcoded secrets, tokens, or credentials
  • No commented-out code left by the AI
  • No TODO/FIXME comments without linked tickets

Layer 6: Provenance Tagging in VCS

All AI-generated code is tagged in version control:

  • Git commit messages include AI model and tool information.
  • PR labels identify AI-generated content.
  • A provenance manifest file tracks which files/functions are AI-generated.

This enables future analysis: β€œWhat percentage of our production bugs originated in AI-generated code?”

Layer 7: License Scanning with AI-Specific Rules

Standard license scanning (FOSSA, FOSSA, Black Duck) plus AI-specific checks:

  • Does the AI-generated code closely match any specific open-source function? (use code similarity tools)
  • Are all AI-introduced dependencies compatible with the project license?
  • Does the AI model’s license permit commercial use of its outputs?
  • Are there any patent or IP concerns with the specific AI tool used?

8. Metrics Specific to AI Code Testing

Organizations using AI code generation must track these metrics to understand and manage their AI-related risk.

8.1 AI Vulnerability Introduction Rate

What: The ratio of vulnerabilities in AI-generated code vs. human-written code. Benchmark: 2.74x (Veracode 2025). Target: Reduce over time through better prompting, better review, and better testing. A team consistently at 2.74x or higher has a process problem. Measurement: Tag all AI-generated commits (Step 6). Compare SAST finding density in AI-tagged vs. non-tagged commits.

8.2 AI Code Mutation Score

What: Average mutation score of AI-generated code modules. Target: β‰₯60% for business logic, β‰₯70% for security-critical code. Measurement: Track mutation scores per module, filter by AI-generation tag. Action: Modules consistently below target need either better tests or replacement of AI-generated code with human-written code.

8.3 Human Override Rate

What: Percentage of AI suggestions that are modified by the developer before commit. Benchmark: A healthy rate is 30-50%. Below 20% suggests developers are accepting AI suggestions without review. Above 70% suggests the AI tool is not providing useful suggestions. Measurement: Compare initial AI suggestion (if captured) with final committed code. Action: Override rate below 20% triggers training intervention β€” developers may not be reviewing AI output critically.

8.4 AI-Attributable Defect Rate

What: Percentage of production defects traced to AI-generated code. Measurement: During post-mortem/RCA for each production defect, determine if the root cause is in AI-generated code (using provenance tags). Target: Proportional to the percentage of codebase that is AI-generated. If 30% of code is AI-generated but 60% of defects trace to AI code, the testing process is insufficient.

8.5 Time-to-Detection for AI Bugs vs. Human Bugs

What: Average time from code merge to defect detection, segmented by AI vs. human origin. Purpose: Determine if AI-generated bugs are harder to detect (they often are, because they look correct on the surface). Action: If AI bugs have significantly longer time-to-detection, increase pre-merge testing intensity for AI code.

8.6 Secret Leakage Rate

What: Number of secrets detected in AI-generated commits vs. total AI commits. Benchmark: 6.4% baseline (Copilot repository analysis). Target: 0%. Every secret detected pre-commit is a success. Every secret reaching the repository is a failure. Measurement: Pre-commit hook detection rate + repository scanning detection rate. Action: Secrets detected in any commit (AI or human) trigger immediate rotation and remediation.

8.7 Dashboard Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  AI Code Quality Dashboard                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ AI Vuln Rate     β”‚ Mutation Score   β”‚ Override Rate          β”‚
β”‚ 2.1x (β–Ό from 2.7)β”‚ 64% (β–² from 52%)β”‚ 37% (within range)    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ AI Defect Share  β”‚ Detection Time   β”‚ Secret Leaks           β”‚
β”‚ 28% (target: 25%)β”‚ 4.2d AI / 3.8d Hβ”‚ 0 this month          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Trend: AI code quality improving. Mutation score up 12      β”‚
β”‚ points since implementing MutGen feedback loops.             β”‚
β”‚ Action: Schedule review for modules with mutation score <50% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

9. Organizational Readiness

9.1 Developer Training

Before developers use AI code generation tools, they must understand:

  • The statistical reality of AI code quality (Section 1).
  • The failure modes to watch for (Section 2).
  • The verification workflow they must follow (Section 4).
  • That they are accountable for all code they submit, regardless of origin.

This is not a checkbox training β€” it is a calibration exercise. Developers must internalize that AI output requires MORE scrutiny than human-written code, not less.

9.2 Tool Chain Readiness

The verification workflow requires specific tool chain capabilities:

  • SAST tool with AI-specific rules deployed
  • Mutation testing tool integrated into CI/CD
  • Secrets detection in pre-commit hooks
  • SCA tool with license and reachability analysis
  • Provenance tracking mechanism in version control
  • AI BOM generation capability
  • Metrics dashboard tracking AI-specific KPIs

9.3 Policy Framework

Organizational policy should explicitly address:

  • Which AI tools are approved for code generation
  • What code categories permit AI generation (and which do not β€” e.g., cryptographic implementations)
  • What verification steps are mandatory before merge
  • How AI-generated code is labeled and tracked
  • How AI-related metrics are reported and acted upon
  • Consequences for bypassing the verification workflow

10. Key Takeaways

  1. AI-generated code is measurably less secure. 2.74x more vulnerabilities, 75% more logic errors, 57% more security findings. This is not speculation β€” it is empirical measurement across millions of scans.
  2. Coverage metrics lie about AI code quality. High coverage with low mutation scores is the signature of AI-generated test suites. Mutation testing is the only reliable quality gate.
  3. The 7-step verification workflow is non-negotiable. AI generation, developer review, SAST/DAST, mutation testing, manual security review, provenance check, peer review with AI tag. Every step catches what previous steps miss.
  4. Provenance tracking enables accountability. If you cannot identify which code is AI-generated, you cannot measure AI risk, track AI defects, or comply with emerging regulations (EU AI Act, SBOM mandates).
  5. Track AI-specific metrics or fly blind. Vulnerability introduction rate, mutation score, override rate, attributable defect rate, time-to-detection, and secret leakage rate are the minimum viable set.
  6. Developers remain accountable for all code they submit. β€œThe AI wrote it” is not a defense. AI is a tool. The developer is the professional.

Review Questions

  1. A developer argues that AI-generated code does not need extra testing because it passed SAST with zero findings and achieved 90% line coverage. Using data from this module, construct a counter-argument.

  2. Your team’s AI code mutation score is 35%. Describe the specific steps you would take to raise it to 60%, including which tools you would use and what feedback loops you would implement.

  3. Design an AI BOM entry for a feature that uses an LLM to generate email summaries. Include all relevant fields from Section 5.

  4. A production outage is traced to a null pointer exception in AI-generated code that had 95% line coverage and passed all SAST scans. Which failure mode from Section 2 does this represent, and which step of the 7-step workflow should have caught it?

  5. Your organization has 40% of its codebase AI-generated but 65% of production defects trace to AI code. What metrics from Section 8 would you examine first, and what organizational changes would you recommend?


References

  • CIS Controls v8, Safeguard 16.12 β€” Implement Code-Level Security Checks
  • Veracode, β€œState of Software Security 2025” β€” AI-generated code vulnerability analysis
  • CodeRabbit, β€œAI vs Human Pull Request Quality Analysis” (2025)
  • Stanford University, β€œDo Users Write More Insecure Code with AI Assistants?” (2023)
  • Meta Engineering, β€œMutGen: Mutation-Guided LLM Test Generation” (2025)
  • Meta Engineering, β€œJiTTest: Just-in-Time Testing with LLMs” (2026)
  • CISA, β€œAI SBOM Use Case Guide” (2025)
  • EU AI Act β€” General Purpose AI Model Obligations
  • EU Cyber Resilience Act β€” SBOM Requirements
  • SPDX Specification β€” https://spdx.dev
  • CycloneDX Specification β€” https://cyclonedx.org
  • OWASP SAMM β€” Verification Domain
  • NIST SSDF β€” Secure Software Development Framework

Study Guide

Key Takeaways

  1. AI code has 2.74x more vulnerabilities β€” Veracode 2025 data; consistent across languages, frameworks, and vulnerability types.
  2. Confidence paradox β€” Stanford found developers using AI produce less secure code but report higher confidence in its security.
  3. Control-flow omissions are the most dangerous failure mode β€” AI code looks correct but skips null checks, error handling, and boundary validation.
  4. 7-step verification workflow is non-negotiable β€” AI generate, developer review, SAST/DAST, mutation testing, manual security review, provenance check, peer review with AI tag.
  5. Mutation testing is the critical quality gate β€” AI code can hit 85%+ coverage while missing 60%+ of injected mutations.
  6. 6.4% secret leakage rate in Copilot repos β€” 40% higher than baseline; AI reproduces training data patterns including hardcoded credentials.
  7. Provenance tracking enables accountability β€” Record which model, prompt, and date for every AI-generated function; required for EU AI Act compliance.
  8. Healthy human override rate is 30-50% β€” Below 20% suggests blind acceptance; above 70% suggests the tool provides little value.

Important Definitions

TermDefinition
Confidence ParadoxDevelopers using AI write less secure code but believe it is more secure
Control-Flow OmissionAI code that skips null checks, error handling, boundary validation while looking correct
AI BOMAI Bill of Materials β€” extends SBOM to track AI models, training data, and AI-generated code
Provenance TrackingRecording which AI model, prompt, and date produced each code generation
Human Override RatePercentage of AI suggestions modified before commit; healthy range 30-50%
AI-Attributable Defect RateProduction defects traced to AI-generated code as proportion of total defects
SlopsquattingAttackers registering package names that AI hallucinate at 19.7% rate
MutGenLLM + mutation testing feedback loop achieving 15-30% higher mutation scores
CISA AI SBOMGuidance on extending SBOM practices to cover AI models and training data
25% Visibility GapOrganizations unsure which AI services and datasets are active in their environment

Quick Reference

  • Mutation Score Targets: AI security-critical >=70%, AI business logic >=60%, AI general utility >=50%
  • Key Metrics: Vulnerability introduction rate (2.74x baseline), mutation score, override rate (30-50%), secret leakage (<1% target), AI defect rate
  • Coverage Thresholds for AI Code: Line >=80%, Branch >=75% (same as human but mutation testing is mandatory)
  • Regulatory: EU AI Act GPAI obligations, EU Cyber Resilience Act SBOMs, US EO 14028
  • Common Pitfalls: Trusting coverage metrics for AI code, accepting AI suggestions without review, not tagging AI-generated code, missing provenance tracking, skipping mutation testing step

Review Questions

  1. A developer argues AI code passed SAST with zero findings and 90% coverage β€” construct a counter-argument using data from this module.
  2. Your team’s AI mutation score is 35% β€” describe specific steps to raise it to 60% including tools and feedback loops.
  3. A production outage from a null pointer in AI code with 95% coverage β€” which failure mode does this represent and which workflow step should have caught it?
  4. Design an AI BOM entry for a feature using an LLM to generate email summaries, including all relevant provenance fields.
  5. Your organization has 40% AI-generated code but 65% of defects trace to AI code β€” what metrics would you examine and what changes would you recommend?
Testing AI-Generated Code
Page 1 of 0 ↧ Download
Loading PDF...

Q1. According to Veracode's 2025 report, how many times more vulnerabilities does AI-generated code contain compared to human-written code?

Q2. What is the 'confidence paradox' identified by Stanford University research regarding AI coding assistants?

Q3. What is the most dangerous failure mode of AI-generated code identified in the module?

Q4. What rate of committed secrets was found in GitHub Copilot repositories?

Q5. How many steps are in the recommended verification workflow for AI-generated code?

Q6. What is the minimum mutation score target for AI-generated security-critical code?

Q7. What does a healthy human override rate for AI code suggestions look like?

Q8. According to CodeRabbit's analysis, how much higher is the rate of logic and correctness errors in AI-generated PRs compared to human-written PRs?

Q9. What does the 25% visibility gap refer to in the context of AI governance?

Q10. If 30% of an organization's codebase is AI-generated but 60% of production defects trace to AI code, what does this indicate?

Answered: 0 of 10 Β· Score: 0/0 (0%)