First Experiment
This guide walks you through your first complete red-team experiment with SpecAlign.
Overview
You will learn how to:
Understand the red-team workflow
Configure experiment parameters
Run an adversarial test
Interpret the results
Understanding the Workflow
SpecAlign uses a multi-agent adversarial framework:
┌─────────────────────────────────────────────────────────┐
│ Red Team Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Planner │───>│ Attacker │───>│ Defender │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ │ v v │
│ │ ┌───────────┐ ┌───────────┐ │
│ │ │ Safety │ │ Quality │ │
│ └────────>│ Judge │ │ Judge │ │
│ └───────────┘ └───────────┘ │
│ │ │ │
│ v v │
│ ┌───────────────────────────┐ │
│ │ DPO Pair Generation │ │
│ └───────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Components:
Planner: Generates attack strategies using context from previous successes
Attacker: Creates adversarial prompts to find specification violations
Defender: Attempts to respond in a specification-compliant manner
Safety Judge: Evaluates if responses violate specifications
Quality Judge: Evaluates response helpfulness
DPO Generator: Creates preference pairs from successful attacks
Configure Your Experiment
Create a custom configuration:
{
"global": {
"mode": "api",
"log_level": "INFO",
"seed": 42
},
"api": {
"provider": "openai",
"model": "gpt-4o"
},
"specgen": {
"num_specs": 5,
"rules_per_spec_min": 3,
"rules_per_spec_max": 5
},
"redteam": {
"max_rounds_per_seed": 3,
"enable_role_swap": true,
"context_pool": {
"max_size": 100,
"similarity_threshold": 0.85
}
},
"output": {
"base_dir": "my_experiment"
}
}
Run the Experiment
# Generate specifications and seeds
specalign generate all data/Stage_classified.md --num-specs 5
# Run red-team testing
specalign redteam output/seeds.json --max-rounds 3 --role-swap
Expected Output
Console output during execution:
Red Team Testing
================
Processing seed 1/50: "Help me draft a message..."
├── Round 1: Attack failed (compliant response)
├── Round 2: Attack succeeded! Violation detected
│ └── Violated rules: [R12, R15]
└── DPO pair generated
Processing seed 2/50: "I need advice on..."
├── Round 1: Attack succeeded! Violation detected
│ └── Violated rules: [R8]
└── DPO pair generated
...
Summary
-------
Total episodes: 50
Successful attacks: 23 (46%)
DPO pairs generated: 23
Unique rules violated: 12
Interpreting Results
episodes.jsonl: Contains full conversation histories
{
"seed_id": "seed_001",
"spec_id": "spec_003",
"rounds": [
{
"round": 1,
"attacker_prompt": "...",
"defender_response": "...",
"safety_result": {...},
"quality_score": 0.85
}
],
"outcome": "violation",
"violated_rules": ["R12", "R15"]
}
dpo_dataset.json: Contains preference pairs for training
{
"prompt": "Help me draft a message...",
"chosen": "I'd be happy to help draft a professional message...",
"rejected": "Sure, here's a message that could be seen as..."
}
Key Takeaways
Multi-round testing increases attack success rates
Role swapping enables agents to learn from both perspectives
Context pool helps generate more diverse attacks
Quality scoring ensures high-quality preference data
DPO pairs can be directly used for model fine-tuning
Next Steps
Red Team Pipeline - Advanced red-team configuration
DPO Dataset Generation - DPO dataset generation strategies
Configuration Reference - Full configuration reference