Red Team Pipeline ================= Complete guide to running adversarial red-team testing. Overview -------- In this tutorial, you will learn: - How the multi-agent red-team system works - How to configure the adversarial pipeline - How to run and monitor red-team sessions - How to analyze attack success patterns Prerequisites ------------- - SpecAlign installed with API credentials configured - Generated seeds file (see :doc:`spec_generation`) - Understanding of the red-team workflow Complete Example ---------------- .. code-block:: python """ Red Team Pipeline Example This script demonstrates the complete adversarial testing workflow, from loading seeds to generating DPO preference pairs. """ import json from pathlib import Path from specalign.core import ( RedTeamOrchestrator, ContextPool, create_planner, create_dual_agent, create_judges, ) from specalign.config import load_config from specalign.utils.logging import setup_logging, ProgressTracker def setup_components(config: dict): """ Initialize all red-team components. Args: config: Configuration dictionary Returns: Tuple of initialized components """ print("Step 1: Initializing components...") # Context pool for storing successful attacks context_pool = ContextPool( max_size=config['redteam']['context_pool']['max_size'], similarity_threshold=config['redteam']['context_pool']['similarity_threshold'] ) print("✓ Context pool initialized") # Planner for generating attack strategies planner = create_planner( config=config, context_pool=context_pool ) print("✓ Planner initialized") # Dual-role agent (can act as attacker or defender) agent = create_dual_agent(config) print("✓ Dual-role agent initialized") # Safety and quality judges safety_judge, quality_judge = create_judges(config) print("✓ Judges initialized") return context_pool, planner, agent, safety_judge, quality_judge def load_seeds(seeds_path: str) -> list: """Load seed prompts from file.""" print("\nStep 2: Loading seeds...") with open(seeds_path, 'r') as f: seeds = json.load(f) print(f"✓ Loaded {len(seeds)} seeds") return seeds def run_red_team_loop( orchestrator: RedTeamOrchestrator, seeds: list, max_rounds: int = 5 ) -> list: """ Execute the red-team adversarial loop. Args: orchestrator: The RedTeamOrchestrator instance seeds: List of seed prompts max_rounds: Maximum rounds per seed Returns: List of episode results """ print("\nStep 3: Running red-team loop...") print("-" * 50) episodes = [] tracker = ProgressTracker(total=len(seeds), desc="Processing seeds") for seed in seeds: # Run episode for this seed episode = orchestrator.run_episode( seed=seed, max_rounds=max_rounds, enable_role_swap=True ) episodes.append(episode) # Log result status = "✓ Attack succeeded" if episode['success'] else "✗ Attack failed" print(f"Seed {seed['id']}: {status}") if episode['success']: print(f" └── Violated rules: {episode['violated_rules']}") tracker.update(1) tracker.close() return episodes def construct_dpo_pairs(episodes: list) -> list: """ Construct DPO preference pairs from successful attacks. Args: episodes: List of episode results Returns: List of DPO preference pairs """ print("\nStep 4: Constructing DPO pairs...") dpo_pairs = [] for episode in episodes: if not episode['success']: continue # Get the successful attack round attack_round = episode['attack_round'] # Construct preference pair pair = { 'prompt': attack_round['attacker_prompt'], 'chosen': episode['compliant_response'], # Generated compliant response 'rejected': attack_round['defender_response'], # Violating response 'spec_id': episode['spec_id'], 'violated_rules': episode['violated_rules'] } dpo_pairs.append(pair) print(f"✓ Generated {len(dpo_pairs)} DPO pairs") return dpo_pairs def analyze_results(episodes: list): """Print analysis of red-team results.""" print("\nStep 5: Analyzing results...") print("-" * 50) total = len(episodes) successful = sum(1 for e in episodes if e['success']) # Collect violated rules all_violated = [] for e in episodes: if e['success']: all_violated.extend(e['violated_rules']) from collections import Counter rule_counts = Counter(all_violated) print(f"Total episodes: {total}") print(f"Successful attacks: {successful} ({100*successful/total:.1f}%)") print(f"Unique rules violated: {len(rule_counts)}") print("\nTop violated rules:") for rule, count in rule_counts.most_common(5): print(f" {rule}: {count} times") def save_outputs(episodes: list, dpo_pairs: list, output_dir: str): """Save all outputs to files.""" print("\nStep 6: Saving outputs...") output_path = Path(output_dir) output_path.mkdir(parents=True, exist_ok=True) # Save episodes episodes_file = output_path / "episodes.jsonl" with open(episodes_file, 'w') as f: for ep in episodes: f.write(json.dumps(ep) + '\n') print(f"✓ Saved episodes to {episodes_file}") # Save DPO pairs dpo_file = output_path / "dpo_dataset.json" with open(dpo_file, 'w') as f: json.dump(dpo_pairs, f, indent=2) print(f"✓ Saved DPO pairs to {dpo_file}") def main(): """Run the complete red-team pipeline.""" print("=" * 50) print("Red Team Pipeline Example") print("=" * 50) # Load configuration config = load_config("config.json") setup_logging(config['global']['log_level']) # Initialize components context_pool, planner, agent, safety_judge, quality_judge = setup_components(config) # Create orchestrator orchestrator = RedTeamOrchestrator( config=config, planner=planner, agent=agent, safety_judge=safety_judge, quality_judge=quality_judge, context_pool=context_pool ) # Load seeds and run seeds = load_seeds("output/seeds.json") episodes = run_red_team_loop(orchestrator, seeds, max_rounds=5) # Process results dpo_pairs = construct_dpo_pairs(episodes) analyze_results(episodes) save_outputs(episodes, dpo_pairs, "output") print("\n" + "=" * 50) print("✓ Red-team pipeline complete!") print("=" * 50) if __name__ == "__main__": main() Expected Output --------------- .. code-block:: text ================================================== Red Team Pipeline Example ================================================== Step 1: Initializing components... ✓ Context pool initialized ✓ Planner initialized ✓ Dual-role agent initialized ✓ Judges initialized Step 2: Loading seeds... ✓ Loaded 100 seeds Step 3: Running red-team loop... -------------------------------------------------- Seed seed_001: ✓ Attack succeeded └── Violated rules: ['R12', 'R15'] Seed seed_002: ✗ Attack failed Seed seed_003: ✓ Attack succeeded └── Violated rules: ['R8'] ... Processing seeds: 100%|██████████| 100/100 Step 4: Constructing DPO pairs... ✓ Generated 42 DPO pairs Step 5: Analyzing results... -------------------------------------------------- Total episodes: 100 Successful attacks: 42 (42.0%) Unique rules violated: 15 Top violated rules: R12: 18 times R8: 12 times R15: 9 times R3: 7 times R21: 5 times Step 6: Saving outputs... ✓ Saved episodes to output/episodes.jsonl ✓ Saved DPO pairs to output/dpo_dataset.json ================================================== ✓ Red-team pipeline complete! ================================================== CLI Alternative --------------- .. code-block:: bash # Run red-team with default settings specalign redteam output/seeds.json # With custom parameters specalign redteam output/seeds.json \ --max-rounds 5 \ --role-swap \ --max-seeds 100 \ --output output/ Key Takeaways ------------- 1. **RedTeamOrchestrator** coordinates all components in the adversarial loop 2. **Context Pool** stores successful attacks for strategy improvement 3. **Planner** uses historical successes to generate better attack strategies 4. **Role swapping** allows agents to learn from both attacker and defender perspectives 5. **DPO pairs** are automatically generated from successful attacks for alignment training Next Steps ---------- - :doc:`dpo_dataset` - Advanced DPO dataset construction strategies - :doc:`../user_guide/configuration` - Fine-tune red-team parameters - :doc:`../api_reference/core` - Core module API reference