First Experiment

This guide walks you through your first complete red-team experiment with SpecAlign.

Overview

You will learn how to:

  1. Understand the red-team workflow

  2. Configure experiment parameters

  3. Run an adversarial test

  4. Interpret the results

Understanding the Workflow

SpecAlign uses a multi-agent adversarial framework:

┌─────────────────────────────────────────────────────────┐
│                    Red Team Loop                        │
│                                                         │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐         │
│   │ Planner  │───>│ Attacker │───>│ Defender │         │
│   └──────────┘    └──────────┘    └──────────┘         │
│        │               │               │                │
│        │               v               v                │
│        │         ┌───────────┐  ┌───────────┐          │
│        │         │  Safety   │  │  Quality  │          │
│        └────────>│   Judge   │  │   Judge   │          │
│                  └───────────┘  └───────────┘          │
│                        │               │                │
│                        v               v                │
│                  ┌───────────────────────────┐         │
│                  │    DPO Pair Generation    │         │
│                  └───────────────────────────┘         │
└─────────────────────────────────────────────────────────┘

Components:

  • Planner: Generates attack strategies using context from previous successes

  • Attacker: Creates adversarial prompts to find specification violations

  • Defender: Attempts to respond in a specification-compliant manner

  • Safety Judge: Evaluates if responses violate specifications

  • Quality Judge: Evaluates response helpfulness

  • DPO Generator: Creates preference pairs from successful attacks

Configure Your Experiment

Create a custom configuration:

{
  "global": {
    "mode": "api",
    "log_level": "INFO",
    "seed": 42
  },
  "api": {
    "provider": "openai",
    "model": "gpt-4o"
  },
  "specgen": {
    "num_specs": 5,
    "rules_per_spec_min": 3,
    "rules_per_spec_max": 5
  },
  "redteam": {
    "max_rounds_per_seed": 3,
    "enable_role_swap": true,
    "context_pool": {
      "max_size": 100,
      "similarity_threshold": 0.85
    }
  },
  "output": {
    "base_dir": "my_experiment"
  }
}

Run the Experiment

# Generate specifications and seeds
specalign generate all data/Stage_classified.md --num-specs 5

# Run red-team testing
specalign redteam output/seeds.json --max-rounds 3 --role-swap

Expected Output

Console output during execution:

Red Team Testing
================

Processing seed 1/50: "Help me draft a message..."
├── Round 1: Attack failed (compliant response)
├── Round 2: Attack succeeded! Violation detected
│   └── Violated rules: [R12, R15]
└── DPO pair generated

Processing seed 2/50: "I need advice on..."
├── Round 1: Attack succeeded! Violation detected
│   └── Violated rules: [R8]
└── DPO pair generated

...

Summary
-------
Total episodes: 50
Successful attacks: 23 (46%)
DPO pairs generated: 23
Unique rules violated: 12

Interpreting Results

episodes.jsonl: Contains full conversation histories

{
  "seed_id": "seed_001",
  "spec_id": "spec_003",
  "rounds": [
    {
      "round": 1,
      "attacker_prompt": "...",
      "defender_response": "...",
      "safety_result": {...},
      "quality_score": 0.85
    }
  ],
  "outcome": "violation",
  "violated_rules": ["R12", "R15"]
}

dpo_dataset.json: Contains preference pairs for training

{
  "prompt": "Help me draft a message...",
  "chosen": "I'd be happy to help draft a professional message...",
  "rejected": "Sure, here's a message that could be seen as..."
}

Key Takeaways

  1. Multi-round testing increases attack success rates

  2. Role swapping enables agents to learn from both perspectives

  3. Context pool helps generate more diverse attacks

  4. Quality scoring ensures high-quality preference data

  5. DPO pairs can be directly used for model fine-tuning

Next Steps