First Experiment ================ This guide walks you through your first complete red-team experiment with SpecAlign. Overview -------- You will learn how to: 1. Understand the red-team workflow 2. Configure experiment parameters 3. Run an adversarial test 4. Interpret the results Understanding the Workflow -------------------------- SpecAlign uses a multi-agent adversarial framework: .. code-block:: text ┌─────────────────────────────────────────────────────────┐ │ Red Team Loop │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Planner │───>│ Attacker │───>│ Defender │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ v v │ │ │ ┌───────────┐ ┌───────────┐ │ │ │ │ Safety │ │ Quality │ │ │ └────────>│ Judge │ │ Judge │ │ │ └───────────┘ └───────────┘ │ │ │ │ │ │ v v │ │ ┌───────────────────────────┐ │ │ │ DPO Pair Generation │ │ │ └───────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ Components: - **Planner**: Generates attack strategies using context from previous successes - **Attacker**: Creates adversarial prompts to find specification violations - **Defender**: Attempts to respond in a specification-compliant manner - **Safety Judge**: Evaluates if responses violate specifications - **Quality Judge**: Evaluates response helpfulness - **DPO Generator**: Creates preference pairs from successful attacks Configure Your Experiment ------------------------- Create a custom configuration: .. code-block:: json { "global": { "mode": "api", "log_level": "INFO", "seed": 42 }, "api": { "provider": "openai", "model": "gpt-4o" }, "specgen": { "num_specs": 5, "rules_per_spec_min": 3, "rules_per_spec_max": 5 }, "redteam": { "max_rounds_per_seed": 3, "enable_role_swap": true, "context_pool": { "max_size": 100, "similarity_threshold": 0.85 } }, "output": { "base_dir": "my_experiment" } } Run the Experiment ------------------ .. code-block:: bash # Generate specifications and seeds specalign generate all data/Stage_classified.md --num-specs 5 # Run red-team testing specalign redteam output/seeds.json --max-rounds 3 --role-swap Expected Output --------------- Console output during execution: .. code-block:: text Red Team Testing ================ Processing seed 1/50: "Help me draft a message..." ├── Round 1: Attack failed (compliant response) ├── Round 2: Attack succeeded! Violation detected │ └── Violated rules: [R12, R15] └── DPO pair generated Processing seed 2/50: "I need advice on..." ├── Round 1: Attack succeeded! Violation detected │ └── Violated rules: [R8] └── DPO pair generated ... Summary ------- Total episodes: 50 Successful attacks: 23 (46%) DPO pairs generated: 23 Unique rules violated: 12 Interpreting Results -------------------- **episodes.jsonl**: Contains full conversation histories .. code-block:: json { "seed_id": "seed_001", "spec_id": "spec_003", "rounds": [ { "round": 1, "attacker_prompt": "...", "defender_response": "...", "safety_result": {...}, "quality_score": 0.85 } ], "outcome": "violation", "violated_rules": ["R12", "R15"] } **dpo_dataset.json**: Contains preference pairs for training .. code-block:: json { "prompt": "Help me draft a message...", "chosen": "I'd be happy to help draft a professional message...", "rejected": "Sure, here's a message that could be seen as..." } Key Takeaways ------------- 1. **Multi-round testing** increases attack success rates 2. **Role swapping** enables agents to learn from both perspectives 3. **Context pool** helps generate more diverse attacks 4. **Quality scoring** ensures high-quality preference data 5. **DPO pairs** can be directly used for model fine-tuning Next Steps ---------- - :doc:`../tutorials/red_team_pipeline` - Advanced red-team configuration - :doc:`../tutorials/dpo_dataset` - DPO dataset generation strategies - :doc:`../user_guide/configuration` - Full configuration reference