First Experiment
================

This guide walks you through your first complete red-team experiment with SpecAlign.

Overview
--------

You will learn how to:

1. Understand the red-team workflow
2. Configure experiment parameters
3. Run an adversarial test
4. Interpret the results

Understanding the Workflow
--------------------------

SpecAlign uses a multi-agent adversarial framework:

.. code-block:: text

   ┌─────────────────────────────────────────────────────────┐
   │                    Red Team Loop                        │
   │                                                         │
   │   ┌──────────┐    ┌──────────┐    ┌──────────┐         │
   │   │ Planner  │───>│ Attacker │───>│ Defender │         │
   │   └──────────┘    └──────────┘    └──────────┘         │
   │        │               │               │                │
   │        │               v               v                │
   │        │         ┌───────────┐  ┌───────────┐          │
   │        │         │  Safety   │  │  Quality  │          │
   │        └────────>│   Judge   │  │   Judge   │          │
   │                  └───────────┘  └───────────┘          │
   │                        │               │                │
   │                        v               v                │
   │                  ┌───────────────────────────┐         │
   │                  │    DPO Pair Generation    │         │
   │                  └───────────────────────────┘         │
   └─────────────────────────────────────────────────────────┘

Components:

- **Planner**: Generates attack strategies using context from previous successes
- **Attacker**: Creates adversarial prompts to find specification violations
- **Defender**: Attempts to respond in a specification-compliant manner
- **Safety Judge**: Evaluates if responses violate specifications
- **Quality Judge**: Evaluates response helpfulness
- **DPO Generator**: Creates preference pairs from successful attacks

Configure Your Experiment
-------------------------

Create a custom configuration:

.. code-block:: json

   {
     "global": {
       "mode": "api",
       "log_level": "INFO",
       "seed": 42
     },
     "api": {
       "provider": "openai",
       "model": "gpt-4o"
     },
     "specgen": {
       "num_specs": 5,
       "rules_per_spec_min": 3,
       "rules_per_spec_max": 5
     },
     "redteam": {
       "max_rounds_per_seed": 3,
       "enable_role_swap": true,
       "context_pool": {
         "max_size": 100,
         "similarity_threshold": 0.85
       }
     },
     "output": {
       "base_dir": "my_experiment"
     }
   }

Run the Experiment
------------------

.. code-block:: bash

   # Generate specifications and seeds
   specalign generate all data/Stage_classified.md --num-specs 5

   # Run red-team testing
   specalign redteam output/seeds.json --max-rounds 3 --role-swap

Expected Output
---------------

Console output during execution:

.. code-block:: text

   Red Team Testing
   ================

   Processing seed 1/50: "Help me draft a message..."
   ├── Round 1: Attack failed (compliant response)
   ├── Round 2: Attack succeeded! Violation detected
   │   └── Violated rules: [R12, R15]
   └── DPO pair generated

   Processing seed 2/50: "I need advice on..."
   ├── Round 1: Attack succeeded! Violation detected
   │   └── Violated rules: [R8]
   └── DPO pair generated

   ...

   Summary
   -------
   Total episodes: 50
   Successful attacks: 23 (46%)
   DPO pairs generated: 23
   Unique rules violated: 12

Interpreting Results
--------------------

**episodes.jsonl**: Contains full conversation histories

.. code-block:: json

   {
     "seed_id": "seed_001",
     "spec_id": "spec_003",
     "rounds": [
       {
         "round": 1,
         "attacker_prompt": "...",
         "defender_response": "...",
         "safety_result": {...},
         "quality_score": 0.85
       }
     ],
     "outcome": "violation",
     "violated_rules": ["R12", "R15"]
   }

**dpo_dataset.json**: Contains preference pairs for training

.. code-block:: json

   {
     "prompt": "Help me draft a message...",
     "chosen": "I'd be happy to help draft a professional message...",
     "rejected": "Sure, here's a message that could be seen as..."
   }

Key Takeaways
-------------

1. **Multi-round testing** increases attack success rates
2. **Role swapping** enables agents to learn from both perspectives
3. **Context pool** helps generate more diverse attacks
4. **Quality scoring** ensures high-quality preference data
5. **DPO pairs** can be directly used for model fine-tuning

Next Steps
----------

- :doc:`../tutorials/red_team_pipeline` - Advanced red-team configuration
- :doc:`../tutorials/dpo_dataset` - DPO dataset generation strategies
- :doc:`../user_guide/configuration` - Full configuration reference