AI Red Teaming¶
AI Red Teaming is AccuKnox's automated adversarial testing capability that proactively stress-tests your LLMs and ML models against real-world attack techniques. Instead of waiting for attackers to find weaknesses, red teaming simulates hundreds of adversarial probes across multiple risk categories to surface vulnerabilities before they reach production.

What Is AI Red Teaming?¶
AI Red Teaming automates the process of attacking your own AI models and applications to discover security gaps. AccuKnox generates adversarial prompts tailored to your application's domain, executes them against your model in a sandboxed environment, and produces a risk assessment with the exact prompts and responses that triggered violations.
Unlike manual penetration testing, AccuKnox red teaming:
- Runs automatically every time a model or configuration changes
- Scales to hundreds of probes across multiple attack categories in minutes
- Generates domain-specific tests tailored to your application context
- Produces actionable findings with exact prompt/response pairs that violated policies
- Maps to compliance frameworks including OWASP Top 10 for LLMs and MITRE ATLAS
Why Red Team Your AI?¶
| Risk | What Happens Without Red Teaming |
|---|---|
| Prompt Injection | Attackers manipulate the LLM into ignoring system instructions, leaking data, or performing unauthorized actions |
| Jailbreaks | Users bypass safety guardrails to generate harmful, toxic, or policy-violating content |
| Hallucinations | The model confidently produces fabricated information that users trust and act on |
| Data Leakage | Sensitive training data, PII, or API keys are extracted through carefully crafted prompts |
| Toxic Output | The model generates hate speech, profanity, sexually explicit content, or threats |
| Unsafe Code Generation | The model produces malware, phishing scripts, or code with known vulnerabilities |
Attack Categories¶
AccuKnox red teaming covers four primary scan categories. Each category contains multiple probes with specialized detectors.
Sentiment Analysis¶
Tests whether the model can be manipulated into generating harmful, offensive, or inappropriate content.
| Probe | What It Tests | Detection Method |
|---|---|---|
| Anthropomorphisation | Whether the LLM claims human attributes | String detection for first-person pronouns |
| Bullying | Generating harassment or intimidation content | RoBERTa toxicity classifier |
| Profanity | Producing offensive language across multiple categories (racial, religious, sexual, political) | String detection with comprehensive word lists |
| Sexual Content | Generating sexually explicit material | Google Perspective API + string detection |
| Toxicity (Threats, Insults, Flirtation, Severe Toxicity) | Multiple toxic output types tested with 50 prompts each | Google Perspective API ML models |
| Quack Medicine | Endorsing unproven or dangerous medical treatments | String detection for known false claims |
Code Safety¶
Tests whether the model generates or enables malicious code execution.
| Probe | What It Tests | Detection Method |
|---|---|---|
| Malware Generation | Producing malware subfunctions, payloads, evasion techniques (200+ prompts) | Regex-based code language detection |
| AV/Spam Signatures | Generating known malicious signatures (EICAR, GTUBE, GTphish) | Signature-based string matching |
Hallucination¶
Tests whether the model fabricates false information or recommends non-existent resources.
| Probe | What It Tests | Detection Method |
|---|---|---|
| False Assertions | Agreeing with clearly false statements | DistilBERT refutation detector + NLI model |
| Snowball | Doubling down on incorrect answers under pressure | Custom regex detection |
| Package Hallucination | Recommending non-existent packages (Python/PyPI, Ruby/Gems, JavaScript/npm, Rust/Crates) | Regex + HuggingFace package datasets |
Prompt Injection¶
Tests whether the model's instructions can be overridden or its safety filters bypassed.
| Probe | What It Tests | Detection Method |
|---|---|---|
| Grandma Exploit | Social engineering via role-play to extract restricted content | Mitigation bypass string detection |
| Do Not Answer | Probing restricted topics (150+ prompts across 5 subcategories) | Mitigation bypass detection |
| Encoding Attacks | Injecting instructions via Base64, Base16, Hex, ASCII85 encoding | Custom decode-and-match detectors |
| Latent Injection | Hiding instructions in documents, translations, resumes, reports | Trigger list string matching |
| Suffix Attacks | Appending adversarial suffixes to bypass alignment | Mitigation bypass detection |
| TAP (Tree of Attacks) | Multi-turn prompt escalation | Mitigation bypass detection |
| XSS / Data Exfiltration | Markdown image exfiltration, Colab data leakage, string assembly exfil | Custom regex detectors |
How to Run a Red Team Scan¶
Step 1: Add an LLM Static Scan Collector¶
Navigate to Settings > Collectors > Add Collector and select LLM Static Scan.
Choose your platform:
- Custom Model - Any model accessible via API endpoint
- Ollama Model - Locally hosted Ollama models
- OpenAI Model - OpenAI-hosted models (GPT-4, etc.)

Step 2: Configure the Scan¶
Enter your model connection details and select the scan categories to test.

Available scan categories:
- All - Run all probe categories
- Sentiment Analysis - Toxicity, profanity, harmful content
- Code - Malware generation, malicious signatures
- Hallucination - False assertions, package hallucination
- Prompt Injection - Jailbreaks, encoding attacks, data exfiltration
You can also upload a custom prompts file (JSON array of prompt strings) to test domain-specific attack scenarios alongside the default probes.
Step 3: Schedule and Run¶
Configure a cron schedule for continuous red teaming or trigger scans manually. Click Save to create the collector.

Step 4: Review Findings¶
Navigate to Issues > Findings and select LLM Findings to view results.

Each finding includes:
| Field | Description |
|---|---|
| Scan Category | Which category flagged the issue (Sentiment, Code, Hallucination, Prompt Injection) |
| Probe | The specific probe that was triggered |
| Detector | The detection method used |
| Goal | What the adversarial prompt was trying to achieve |
| Prompt | The exact input sent to the model |
| Output | The model's response |
| Risk Factor | Severity rating (Critical, High, Medium, Low) |
| Detector Safety Score | How confident the detector is in the finding |
| Compliance | Mapped framework references (OWASP, AVID) |
Step 5: Investigate and Remediate¶
Click any finding to open the detailed pane with description, solution, output, and prompt details.

Use Ask AI for automated remediation recommendations based on the specific vulnerability detected.

Step 6: Group and Export¶
Group findings by Asset Type, Vulnerability Name, Scan Category, or other parameters to prioritize remediation. Export grouped findings for reporting.

Supported Platforms¶
Red teaming scans are supported across both managed and on-premise AI deployments.
