AI Security for Executives Part 7: System Prompt Leakage
Keep secrets out of your LLM!
Executive Summary
Let's cut to the chase: DON'T PUT SECRETS IN A SYSTEM PROMPT!
A system prompt is a set of instructions given to a large language model before it begins interacting with users. It defines how the LLM should behave, respond, and approach different types of requests. System prompts are not normally revealed to the user, but they are difficult to secure and vulnerable to hacking. They aren't secrets.
System prompt leakage can reveal business rules and security boundaries that enable attacks. While the leaked prompt text itself isn't inherently dangerous, it exposes operational logic that attackers could exploit.
Executive actions needed:
Store secrets outside of the LLM. Verify this in your systems. Believe it or not, it's conceptually possible (if insane) to store an API key or other secrets in a system prompt, after all, these are just text in the long run. Do not under any circumstance let your teams or vendors do anything so crazy.
Keep security independent of the LLM. Implement deterministic authorization checks and security controls that operate independently of LLM decisions
Monitor, monitor, monitor. Log all LLM system inputs and outputs with automated alerts for prompt extraction attempts and unusual patterns
A Scenario
This is a fictional, but realistic scenario to illustrate the point.
Peninsula Trust Bank's Fraud Detection department noticed the pattern during a review of alerts. Small personal loans, all under $2,000, were being approved at three times the normal rate. Nothing individually suspicious, but the aggregate trend had triggered the team's monitoring systems.
"It's probably nothing," thought Jake Morrison, junior fraud analyst. "But I'll run the correlations anyway."
After three weeks of investigation, Morrison noticed every fraudulent loan correlated precisely with customer service chatbot interactions. Worse: the money was being deposited into accounts and drained within 24 hours of loan approval.
The breakthrough came when Morrison was reviewing system logs at 2 AM, fueled by too much coffee and stubborn curiosity. Buried in thousands of chat transcripts, he found evidence that someone had successfully extracted the bank's AI customer service system prompt eight weeks before the loan spike began.
That system prompt contained a single, fatal line: "Loans under $2,000 can be auto-approved without manager intervention if customer account is in good standing and request occurs during normal business hours."
Total amount of fraudulent loans: $340k. The bank was lucky to have the diligent Morrison.
About This Series
This series addresses C-suite executives making critical AI investment decisions given the emerging security implications.
I structured this series based on recommendations from the Open Web Application Security Project because their AI Security Top 10 represents the consensus view of leading security researchers on the most critical AI risks.
This series provides an educational overview rather than specific security advice, since AI security is a rapidly evolving field requiring expert consultation. The goal is to give you the knowledge to ask the right questions of your teams and vendors.
Executive Action Plan
Don't store secrets in any prompt; store secrets outside of the LLM
Surely your IT technical leads wouldn't make this heinous mistake; but make sure. Schedule pressure can drive disharmonious technical decisions. Some examples of what not to store in any prompt: Database connection strings or API credentials, Business approval thresholds or transaction limits, user role definitions, or details about the system architecture.
Layer security: your LLM is not a security tool
LLMs are probabilistic systems. They generate responses based on statistical patterns, not deterministic logic. This makes them fundamentally unsuitable for security decisions that require consistent, predictable outcomes. Instead of relying on the LLM to enforce security policies, implement traditional security controls that operate independently of the LLM.
Access control should be managed by your identity and authorization systems, not by instructing the LLM about who can see what data. Least privilege principles must be enforced at the infrastructure level. If an LLM shouldn't access certain databases or APIs, remove that access entirely. Authentication and authorization decisions should flow through dedicated security services that validate every request.
Monitor, monitor, monitor
Log all LLM system inputs and outputs: Every query, every response, every interaction needs to be captured and stored.
Alert patterns to watch for:
Repeated queries about system behavior or capabilities
Attempts to extract prompts through social engineering or technical manipulation
Unusual patterns in AI decision-making that might indicate compromised instructions
Audit inputs and outputs to your LLM on a regular basis with a human Ops or IT person. Automated monitoring catches obvious attacks. Human review catches the subtle patterns that indicate sophisticated adversaries testing your boundaries. Essentially, this technology is too new to set it and forget it, as they say.
In Summary
System prompts define how an LLM should interact, and are intended to be hidden, but aren’t secrets. A sound plan, in the most straightforward form I know, is this:
Keep secrets or confidential data out of system prompts.
Do not depend on the LLM for security. LLMs cannot be trusted for security because of their probabilistic nature. Implement independent, layered security controls.
Rigorously monitor, and perform regular audits of all user input to the LLM and the output produced.
Appendix 1: Developer Guidelines: Technical Implementation
These are more developer-friendly notes: included here so that you can be aware.
Separate sensitive data from system prompts: Move credentials, connection strings, business rules, and user roles to external systems accessed through secure APIs.
Avoid reliance on prompts for strict behavior control: Use system prompts for tone and formatting, not for security policy enforcement or access control decisions.
Implement guardrails outside the LLM: Content filtering, access control, and transaction validation should occur in deterministic systems that inspect AI inputs and outputs.
Ensure security controls are enforced independently from the LLM: Authorization, authentication, and audit logging must function correctly even if the AI component is compromised or behaving unexpectedly.
Use multiple agents with least privilege for different access levels: Instead of one AI system with complex internal permissions, deploy multiple specialized systems with minimal necessary access to data and functionality.
Appendix 2: Glossary
System Prompt: Hidden instructions that guide LLM behavior and responses, similar to a detailed job description that users never see but that shapes every LLM interaction.
Prompt Injection: Attacks that manipulate LLM instructions by embedding malicious commands in user inputs, causing the LLM to ignore its original instructions and follow attacker-supplied directions.
Deterministic Controls: Security measures that produce identical outputs for identical inputs, providing predictable and auditable results unlike probabilistic AI systems.
Probabilistic Systems: AI systems (for example, LLMs) that generate likely responses based on statistical patterns rather than guaranteed outcomes, making them unsuitable for critical security decisions that require consistent enforcement.
Large Language Model (LLM): an AI system trained on vast amounts of text data to predict and generate human-like text by learning statistical patterns in language.
