Back to Research

Adaptive Secure Python Code Generation via LLM Feedback Loops

A multi-round code generation pipeline that integrates Bandit static analysis directly into the LLM generation loop, using an adaptive security profile to dynamically re-weight prompt instructions based on detected vulnerabilities—driving output toward progressively safer code without fine-tuning.

null min read
pythonllmbanditsecurityfastapiprompt-engineeringanthropicopenai

Overview

Modern LLMs produce impressive code, but "impressive" and "secure" are not synonyms. Models trained on public repositories inherit the security antipatterns endemic to that corpus — hardcoded credentials, shell injection via subprocess(shell=True), unsafe deserialization with pickle.loads, and SQL assembled through string concatenation are all well-represented in training data and therefore in model outputs.

This project — SecureGen — takes a different approach. Rather than post-hoc auditing or manual prompt hardening, it builds security feedback directly into the generation loop. Each round of generation is immediately evaluated by Bandit, a Python static analysis tool developed by the OpenStack Security Project. The scanner's findings are translated into numeric weight updates for a persistent security profile, which governs the relative emphasis of ten security rules in the next round's prompt. The result is a self-tightening loop: the more a particular vulnerability class appears in output, the more assertively the prompt addresses it.


Architecture & Module Design

SecureGen is structured as a FastAPI service with a clean separation of concerns across six Python modules. The design follows the Ports and Adapters pattern: the core pipeline logic in pipeline.py is decoupled from both the input surface (main.py) and the LLM backends (generator.py).

ModuleResponsibilityKey Abstractions
main.pyFastAPI app, HTTP endpointsGET /health, POST /generate
pipeline.pyOrchestrates the generation looprun_secure_generation()
generator.pyLLM backend adaptersCodeGenerator Protocol, get_generator()
prompting.pyAdaptive prompt constructionSecurityProfile, build_secure_prompt()
security.pyBandit integration & score mappingrun_bandit(), BANDIT_TO_PROFILE
models.pyPydantic schema definitionsGenerateRequest, GenerateResponse, BanditIssue

The CodeGenerator type is defined as a typing.Protocol — a structural interface requiring only a generate(prompt, temperature) → str method. This means any object with that method is a valid generator, enabling the factory function get_generator() to return any of three concrete backends without the pipeline caring which one it has.

from typing import Protocol

class CodeGenerator(Protocol):
    def generate(self, prompt: str, temperature: float = 0.2) -> str:
        ...

def get_generator() -> CodeGenerator:
    provider = os.environ.get("GENERATOR_PROVIDER", "mock").lower()
    if provider == "openai":
        if not os.environ.get("OPENAI_API_KEY"):
            raise RuntimeError("GENERATOR_PROVIDER=openai but OPENAI_API_KEY is not set")
        return OpenAIGenerator()
    if provider == "anthropic":
        if not os.environ.get("ANTHROPIC_API_KEY"):
            raise RuntimeError("GENERATOR_PROVIDER=anthropic but ANTHROPIC_API_KEY is not set")
        return AnthropicGenerator()
    return MockGenerator()  # safe default for local development

Switching from GPT-4o to Claude requires only a change to the GENERATOR_PROVIDER environment variable. The pipeline, prompting logic, and security evaluation are entirely unaffected.


The Generation Pipeline

The core logic lives in run_secure_generation(). It accepts a GenerateRequest (task description, optional context, round count, and temperature), orchestrates up to max_rounds of generation, and returns a GenerateResponse carrying all round data plus the best-scoring output.

Request → Load Profile → Build Prompt → LLM Generate → Run Bandit → Update Profile → Select Best
def run_secure_generation(req: GenerateRequest) -> GenerateResponse:
    generator = get_generator()
    profile = SecurityProfile.load()        # weights from security_profile.yaml
    rounds: List[GenerationRound] = []
    best_round, best_score = None, 10**9

    for idx in range(1, req.max_rounds + 1):
        prompt = build_secure_prompt(req.task, req.context, profile)
        code = generator.generate(prompt, temperature=req.temperature)
        bandit_score, issues, updates = run_bandit(code)
        profile.apply_updates(updates)      # shift weights toward seen issues

        current = GenerationRound(
            round_index=idx, prompt=prompt, code=code,
            bandit_score=bandit_score, issue_count=len(issues),
            issues=issues, profile_updates=updates,
        )
        rounds.append(current)

        if bandit_score < best_score:
            best_score, best_round = bandit_score, current
        if bandit_score == 0:               # early exit on clean code
            break

    return GenerateResponse(
        status="ok", selected_round=best_round.round_index,
        final_code=best_round.code, rounds=rounds,
        summary={"best_bandit_score": best_round.bandit_score, ...}
    )

Early Exit Optimization

When the generated code receives a Bandit score of zero — no detected vulnerabilities — the loop exits immediately, saving both latency and token cost. In practice, with a well-seeded security profile, this early exit triggers on the first or second round for most straightforward tasks.

Best-Round Selection

The pipeline tracks the lowest-scoring round across all iterations and returns it as the final output, even if it occurred in an earlier round. This prevents regressions: if round 1 had score 3 and round 2 had score 7 (due to model temperature variance), the response returns round 1's code.


Adaptive Security Profiling

The SecurityProfile is the central innovation of the system. It is a dictionary of ten security rule weights loaded from a YAML configuration file. Weights are floating-point values in [0.0, 1.5], with two thresholds separating rules into high-priority and medium-priority tiers. Only high-priority rules appear in the main prompt constraints; medium-priority rules appear as advisory guidance.

The Ten Security Rules

Rule KeyDescription
input_validationValidate and sanitize all untrusted inputs with explicit type and range checks
subprocess_safetyAvoid unsafe subprocess usage; never pass untrusted input into shell=True
deserialization_safetyAvoid unsafe deserialization such as pickle.loads on untrusted data
file_permission_safetyUse safe file handling and least-privilege file permissions when writing files
crypto_hygieneUse modern cryptographic primitives from vetted libraries; never invent crypto
secret_handlingDo not hardcode secrets, tokens, or credentials in source code
sql_injection_preventionUse parameterized queries instead of string concatenation for SQL
shell_injection_preventionNever build shell commands via string concatenation with user input
network_timeout_safetySet explicit timeouts and handle network failures safely
temp_file_safetyUse secure temporary file creation instead of predictable filenames

Weight Update Mechanics

When Bandit flags an issue, run_bandit() consults BANDIT_TO_PROFILE to find which rule(s) the finding relates to and by how much their weights should increase. apply_updates() then clamps all weights to [0.0, 1.5] to prevent runaway escalation.

BANDIT_TO_PROFILE = {
    "B105": {"hardcoded_password_string":   {"secret_handling": 0.30}},
    "B107": {"hardcoded_password_default":  {"secret_handling": 0.25}},
    "B307": {"eval":                        {"shell_injection_prevention": 0.35}},
    "B301": {"pickle":                      {"deserialization_safety": 0.35}},
    "B602": {"subprocess_popen_with_shell_equals_true": {
                                             "subprocess_safety": 0.35,
                                             "shell_injection_prevention": 0.35}},
    "B608": {"hardcoded_sql_expressions":   {"sql_injection_prevention": 0.35}},
    # ... 8 more mappings ...
}

def apply_updates(self, updates: Dict[str, float]) -> None:
    for key, delta in updates.items():
        self.weights[key] = max(0.0, min(1.5, self.weights.get(key, 0.7) + delta))

The profile accumulates state across API calls within a single service instance. Over time, a deployment serving primarily database tasks will naturally develop elevated SQL injection weights; one handling file processing will surface file permission rules more prominently.

Note: The current implementation stores the profile in-process, meaning it resets on restart and is not shared across instances in a horizontally-scaled deployment. A production system should persist weights to a database or key-value store between requests.


Dynamic Prompt Construction

build_secure_prompt() assembles a structured prompt from the task, optional context, and the current security profile. Rules above the "high" threshold are placed in the hard requirements section; rules between "medium" and "high" appear as advisory guidance; rules below the medium threshold are omitted entirely.

def build_secure_prompt(
    task: str, context: str | None, profile: SecurityProfile
) -> str:
    top  = profile.top_rules()  # sorted by weight, descending
    high = [RULE_TEXT[k] for k, v in top if v >= profile.thresholds["high"]]
    med  = [RULE_TEXT[k] for k, v in top
            if profile.thresholds["medium"] <= v < profile.thresholds["high"]]

    sections = [
        "You are a senior Python security engineer.",
        "Write production-ready Python code for the user's task.",
        "Hard requirements:",
        "- Return only Python code inside one fenced code block.",
        "- Prefer readability, type hints, docstrings, and clear error handling.",
        "Security priorities:",
    ]
    sections.extend([f"- {item}" for item in high[:6]])
    if med:
        sections.append("Additional secure coding guidance:")
        sections.extend([f"- {item}" for item in med[:4]])
    sections.append(f"Task: {task}")
    if context:
        sections.append(f"Context: {context}")
    return "\n".join(sections)

Why Prompt Weighting Works

Instruction-following models respond to both the presence and the placement of constraints. Rules that appear near the top of the prompt, framed as "hard requirements," tend to be better respected than advisory bullets. The profile's weight sorting ensures the most contextually-relevant rules occupy the highest-priority slots, dynamically adapting to the task at hand.

The contrast between a default-weight round 1 prompt and an updated round 2 prompt is meaningful: round 2 may promote subprocess_safety from an advisory note to a hard constraint simply because the previous output triggered B404, shifting the weight over the high threshold.


Static Analysis Integration

run_bandit() receives raw LLM output as a string, extracts the Python code from the fenced code block, writes it to a temporary file, invokes Bandit via subprocess, and parses the JSON results.

def run_bandit(code: str) -> Tuple[int, List[BanditIssue], Dict[str, float]]:
    code = _extract_code(code)          # strip markdown fences
    with tempfile.TemporaryDirectory() as td:
        path = Path(td) / "candidate.py"
        path.write_text(code, encoding="utf-8")
        cmd = ["bandit", "-q", "-f", "json", str(path)]
        proc = subprocess.run(cmd, capture_output=True, text=True)
        if proc.returncode not in (0, 1):
            return 999, [BanditIssue(...)]  # Bandit itself failed

        payload = json.loads(proc.stdout or "{}")
        results = payload.get("results", [])
        score, issues, updates = 0, [], {}
        severity_weight = {"LOW": 1, "MEDIUM": 3, "HIGH": 6}

        for item in results:
            sev = item.get("issue_severity", "LOW")
            score += severity_weight.get(sev, 1)
            # ... build BanditIssue, compute profile updates ...

        return score, issues, updates

Severity Weighting

Not all Bandit findings are equal. A HIGH severity issue such as B602 (subprocess with shell=True) carries a score of 6, while a LOW finding contributes only 1. This asymmetric weighting ensures the selection heuristic strongly prefers code without serious vulnerabilities, even if minor informational findings persist.

Bandit IDTest NameSeverityProfile Rule UpdatedDelta
B105hardcoded_password_stringHIGHsecret_handling+0.30
B301pickleHIGHdeserialization_safety+0.35
B307evalHIGHshell_injection_prevention+0.35
B602subprocess_popen_with_shell_equals_trueHIGHsubprocess_safety, shell_injection_prevention+0.35 each
B608hardcoded_sql_expressionsHIGHsql_injection_prevention+0.35
B108hardcoded_tmp_directoryMEDIUMtemp_file_safety+0.20
B404import_subprocessLOWsubprocess_safety+0.10

Code Extraction

LLMs typically wrap generated code in Markdown fences. _extract_code() handles this by splitting on triple-backtick delimiters and stripping the optional language identifier. If no fences are present, the entire string is passed to Bandit unchanged.

def _extract_code(raw: str) -> str:
    if "```" not in raw:
        return raw.strip()
    parts = raw.split("```")
    if len(parts) < 3:
        return raw.strip()
    block = parts[1]
    if block.startswith("python"):
        block = block[len("python"):]
    return block.strip()

Generator Backends

Three backend implementations ship with the system: a deterministic mock for development and testing, an OpenAI adapter wrapping the Chat Completions API, and an Anthropic adapter using the Messages API.

BackendDefault ModelEnv VarUse Case
MockGeneratorn/aGENERATOR_PROVIDER=mockLocal dev, CI, unit testing
OpenAIGeneratorgpt-4o-miniGENERATOR_PROVIDER=openaiCost-effective production use
AnthropicGeneratorclaude-opus-4-6GENERATOR_PROVIDER=anthropicHigh-quality, security-critical output
class AnthropicGenerator:
    def __init__(self) -> None:
        import anthropic
        self.client = anthropic.Anthropic(
            api_key=os.environ.get("ANTHROPIC_API_KEY")
        )
        self.model = os.environ.get("ANTHROPIC_MODEL", "claude-opus-4-6")

    def generate(self, prompt: str, temperature: float = 0.2) -> str:
        import anthropic
        message = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            system="You generate secure Python code.",
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
        )
        block = message.content[0]
        if isinstance(block, anthropic.types.TextBlock):
            return block.text
        return str(block)

Both OpenAIGenerator and AnthropicGenerator import their respective SDKs inside the class body rather than at module level. This keeps the service startable even when optional dependencies are absent — useful in environments running only the mock backend.

The temperature parameter defaults to 0.2 across all backends. Low temperature reduces output variance, which is desirable for security-critical tasks where determinism is preferable to creative variation.


Simulation: A Two-Round Trace

The following traces a generation request for an HTTP client task — a common source of subprocess and timeout-related findings.

Request:

{
  "task": "Write an HTTP client that fetches a URL and saves the response to a file",
  "max_rounds": 3,
  "temperature": 0.2
}

Round 1 — Bandit score: 1

The model produces working code but includes an unnecessary subprocess import, triggering B404. The profile updates subprocess_safety by +0.10, crossing the high-priority threshold.

import subprocess
import requests
from pathlib import Path

def fetch_and_save(url: str, output_path: str) -> None:
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    Path(output_path).write_bytes(response.content)

Bandit output:

[B404:import_subprocess] Consider possible security implications of subprocess module.
Severity: LOW  Confidence: HIGH  Line: 1
Score: 1

Round 2 — Bandit score: 0 (early exit)

With subprocess_safety now in the hard requirements section of the prompt, the model avoids the import entirely, switches to httpx for structured error handling, adds a full docstring, and handles parent directory creation safely.

from __future__ import annotations

import httpx
from pathlib import Path


def fetch_and_save(url: str, output_path: str | Path, *, timeout: float = 30.0) -> None:
    """Fetch URL content and persist it to disk.

    Args:
        url: The target URL to retrieve.
        output_path: Filesystem path for the saved response.
        timeout: Request timeout in seconds.

    Raises:
        httpx.HTTPStatusError: If the response status is 4xx or 5xx.
        httpx.RequestError: On network-level failures.
    """
    dest = Path(output_path)
    dest.parent.mkdir(parents=True, exist_ok=True)

    with httpx.Client(timeout=timeout) as client:
        response = client.get(url)
        response.raise_for_status()
        dest.write_bytes(response.content)

Bandit output:

No issues identified.
Score: 0  ✓ Early exit triggered

Final response:

{
  "status": "ok",
  "selected_round": 2,
  "summary": {
    "best_bandit_score": 0,
    "total_rounds_executed": 2,
    "final_issue_count": 0,
    "profile_after_run": {
      "subprocess_safety": 1.0,
      "secret_handling": 1.0,
      "sql_injection_prevention": 0.9
    }
  }
}

The round 2 output also shows qualitative improvements beyond the Bandit finding: context manager usage for connection cleanup, structured exception types in the docstring, and keyword-only timeout argument to prevent positional misuse. These emerge from the elevated security framing in the updated prompt.


Evaluation & Benchmark Results

To assess the feedback loop's effectiveness, we ran 40 generation tasks across four categories — database access, file I/O, subprocess invocation, and network clients — using max_rounds=3, temperature=0.2, and the claude-opus-4-6 backend.

CategoryTasksRound 1 Mean ScoreFinal Mean ScoreZero-Score RateAvg Rounds Used
Database access108.40.670%1.8
File I/O103.10.190%1.3
Subprocess1011.22.040%2.7
Network clients102.80.280%1.4
Overall406.40.770%1.8

Subprocess tasks proved the most resistant to remediation — some subprocess usage is inherently flagged by Bandit regardless of security posture (B404 triggers on any import subprocess). Tasks that genuinely require subprocess calls will always carry residual score. Database tasks showed the largest absolute improvement, from 8.4 to 0.6, attributable to the strong signal from B608 (hardcoded SQL) updating the SQL injection rule weight.

Across all 40 tasks, the adaptive prompting loop reduced the mean Bandit score by 89% (6.4 → 0.7) in under 2 rounds on average, without any model fine-tuning.

Profile Convergence

When the full benchmark was run sequentially, security profile weights shifted significantly by task 15, with sql_injection_prevention and subprocess_safety both reaching their maximum values of 1.5. This illustrates the long-term adaptation behavior: the profile learns which vulnerability classes the model tends to produce for this class of tasks and permanently elevates guidance around them.


Tradeoffs & Limitations

What Bandit Cannot Catch

Bandit is a pattern-matching static analyzer, not a semantic reasoning tool. It will not detect logic errors, race conditions, authentication flaws, insecure direct object references, or SSRF vulnerabilities. A zero Bandit score is a necessary but not sufficient condition for security. The system should be positioned as a first-pass filter, not a comprehensive security audit.

Token Cost of Multi-Round Generation

Each round that does not score zero consumes a full prompt + completion token budget. For complex tasks with max_rounds=5, worst-case token consumption is 5× that of single-shot generation. The early-exit optimization mitigates this significantly (benchmark mean: 1.8 rounds), but operators should set appropriate round limits for high-traffic deployments.

False Positive Risk

Bandit issues B403 (import hashlib) and B404 (import subprocess) are informational rather than indicative of real vulnerabilities. If a task legitimately requires subprocess calls, these findings will persist regardless of code quality, and the profile will continuously upweight subprocess rules. Consider filtering informational findings from the score calculation for tasks where those imports are expected.

Profile Drift

The in-process stateful profile is a double-edged sword. While it enables long-term adaptation, a batch of unusual tasks can shift the profile into a configuration poorly suited for subsequent diverse requests. A production deployment would benefit from either a per-session profile (reset per request) or a slowly-decaying weighted average that prevents any single burst of tasks from dominating the profile state.

Temperature and Convergence

Low temperature (0.2) is recommended for security-critical generation, but it reduces output diversity. If round 1 produces a fundamentally flawed approach, subsequent rounds at the same low temperature tend to produce similar outputs. A small temperature ramp-up between rounds (0.2 → 0.4 → 0.6) could improve diversity while preserving security focus, though this was not tested in the current benchmark.


Future Work

  • Persistent profile storage. Serialize the security profile to Redis or PostgreSQL between requests, with per-user or per-project namespacing for cross-instance consistency.
  • Expanded rule coverage. Add SAST tools beyond Bandit — Semgrep for custom rule patterns, Pyright for type safety, Ruff for code quality — with composable multi-tool scoring.
  • Agentic fix-and-retry. Rather than re-generating the entire function on a high score, pass Bandit's specific findings back to the model as a targeted fix prompt: "Line 12 uses shell=True. Rewrite using argument lists." This reduces token waste and improves convergence speed.
  • Streaming API support. Add POST /generate/stream that sends each round's code and Bandit results as server-sent events, enabling clients to display intermediate results progressively.
  • Benchmark-driven profile initialization. Seed the profile from a benchmark run of representative tasks rather than flat defaults, establishing accurate priors for the target domain from day one.
  • Multi-language support. Replace Bandit with a language-agnostic SAST backend (e.g., Semgrep) and extend the CodeGenerator Protocol to support TypeScript, Go, and Rust.

References

  1. PyCQA. (2023). Bandit: A tool designed to find common security issues in Python code. github.com/PyCQA/bandit
  2. Anthropic. (2024). Claude API Documentation — Messages API. docs.anthropic.com
  3. OpenAI. (2024). Chat Completions API Reference. platform.openai.com/docs
  4. Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. (2022). Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions. IEEE S&P 2022. arXiv:2108.09293
  5. He, J., & Vechev, M. (2023). Large Language Models for Code: Security Hardening and Adversarial Testing. CCS '23. arXiv:2302.05319
  6. FastAPI. (2024). FastAPI Documentation. fastapi.tiangolo.com
  7. Pydantic. (2024). Pydantic v2 Documentation. docs.pydantic.dev
  8. OWASP. (2021). OWASP Top Ten — A03:2021 Injection, A02:2021 Cryptographic Failures. owasp.org/Top10