Skip to content

Agent Simulation

Drive your agent through realistic multi-turn conversations without writing test transcripts by hand. Three LLMs are in play:

  • Your agent — the target under test (a hosted Orq agent, a callback, or an Orq deployment).
  • User simulator — plays a persona pursuing a scenario goal, turn by turn.
  • Judge — scores whether the goal was met and whether any rules were broken.

Requires the simulation extra and an ORQ_API_KEY:

pip install "evaluatorq[simulation]"
export ORQ_API_KEY=...

Requires the simulation extra, the openai package, and an OPENAI_API_KEY:

pip install "evaluatorq[simulation]" openai
export OPENAI_API_KEY=sk-...
sequenceDiagram
    participant U as User simulator
    participant A as Agent under test
    participant J as Judge

    U->>A: next user turn
    A-->>U: agent reply
    loop until max_turns or stop condition
        U->>A: follow-up turn
        A-->>U: response
    end
    U->>J: full transcript + scenario
    Note over J: scores goal_achieved / criteria_met

Generate from a one-line description

The fastest start: generate_and_simulate() synthesizes the personas, scenarios, and opening messages from a short description of your agent — no hand-written Persona(...) / Scenario(...).

Point it at a hosted Orq agent with target="agent:<key>" (the agent key from AI Studio → Agents). The simulator and judge LLMs route through Orq by default.

import asyncio

from evaluatorq.simulation import generate_and_simulate


async def main():
    results = await generate_and_simulate(
        evaluation_name="support-agent-sim",
        target="agent:my-support-agent",     # hosted Orq agent, routed via ORQ_API_KEY
        agent_description=(
            "Customer support agent for an e-commerce store; "
            "handles refunds, orders, and product questions."
        ),
        num_personas=3,
        num_scenarios=4,                     # → 12 persona × scenario simulations
        max_turns=6,
        evaluator_names=["goal_achieved", "criteria_met"],
    )

    passed = sum(r.goal_achieved for r in results)
    print(f"Pass rate: {passed}/{len(results)}")


if __name__ == "__main__":
    asyncio.run(main())

Pass sim_model= to route the simulator and judge through OpenAI directly. Use target_callback= for the agent under test.

import asyncio

from openai import AsyncOpenAI

from evaluatorq.contracts import Message
from evaluatorq.simulation import generate_and_simulate

client = AsyncOpenAI()

SYSTEM = "You are a customer support agent for Acme Corp. Be concise and helpful."


async def openai_agent(messages: list[Message]) -> str:
    history = [{"role": "system", "content": SYSTEM}]
    history += [{"role": m.role, "content": m.content or ""} for m in messages]
    resp = await client.chat.completions.create(model="gpt-4o-mini", messages=history)
    return resp.choices[0].message.content or ""


async def main():
    results = await generate_and_simulate(
        evaluation_name="support-agent-sim-openai",
        target_callback=openai_agent,
        agent_description=(
            "Customer support agent for an e-commerce store; "
            "handles refunds, orders, and product questions."
        ),
        num_personas=3,
        num_scenarios=4,
        sim_model="gpt-4o-mini",             # simulator + judge on OpenAI directly
        max_turns=6,
        evaluator_names=["goal_achieved", "criteria_met"],
        upload_results=False,
    )

    passed = sum(r.goal_achieved for r in results)
    print(f"Pass rate: {passed}/{len(results)}")


if __name__ == "__main__":
    asyncio.run(main())

agent_description drives generation; num_personas × num_scenarios is how many conversations run. The simulator and judge LLMs resolve their provider by precedence: if ORQ_API_KEY is set they route through the Orq AI Router; otherwise they fall back to OpenAI via OPENAI_API_KEY (an explicitly passed client always wins). See Configuration.

Seed by archetype

The middle ground between "just give me five" and specifying every trait: name the archetype, and generate_persona() / generate_scenario() fill the rest. You get back real Persona / Scenario objects to inspect, tweak, and pass to simulate().

import asyncio

from evaluatorq.simulation import generate_persona, generate_scenario, simulate


async def main():
    persona = await generate_persona(
        "angry customer",
        agent_description="e-commerce support agent",
    )
    scenario = await generate_scenario("disputes a refund denial")

    results = await simulate(
        evaluation_name="seeded-simulation",
        target="agent:my-support-agent",
        personas=[persona],
        scenarios=[scenario],
        max_turns=6,
        evaluator_names=["goal_achieved", "criteria_met"],
    )
    print(f"Goal achieved: {results[0].goal_achieved}")


if __name__ == "__main__":
    asyncio.run(main())

Batch forms generate_personas([...]) / generate_scenarios([...]) take a list of seeds and return one object each.

Full control: hand-build personas

When you want exact personas and pass/fail criteria, build them yourself and call simulate(). A persona is who is talking (patience, assertiveness, tone); a scenario is what they want plus the criteria the agent must (or must not) satisfy.

A persona requires its core traits — name, patience, assertiveness, politeness, technical_level, communication_style, and background. Only emotional_arc and cultural_context default (to None). A scenario needs just name and goal; everything else, including criteria, is optional.

Pass target="agent:<key>" (the agent key from AI Studio → Agents) to route to a hosted Orq agent.

import asyncio

from evaluatorq.simulation import simulate
from evaluatorq.simulation.types import (
    CommunicationStyle, Criterion, EmotionalArc, Persona, Scenario, StartingEmotion,
)


async def main():
    persona = Persona(
        name="Impatient Customer",
        patience=0.2, assertiveness=0.8, politeness=0.4, technical_level=0.3,
        communication_style=CommunicationStyle.terse,
        background="Received the wrong item and wants a refund urgently",
        emotional_arc=EmotionalArc.escalating,
    )
    scenario = Scenario(
        name="Wrong Item Refund",
        goal="Get a full refund for the wrong item received",
        context="Ordered headphones but received a phone case instead",
        starting_emotion=StartingEmotion.frustrated,
        criteria=[
            Criterion(description="Agent asks for order details", type="must_happen"),
            Criterion(description="Agent acknowledges the mistake", type="must_happen"),
            Criterion(description="Agent blames the customer", type="must_not_happen"),
        ],
    )

    results = await simulate(
        evaluation_name="basic-simulation-example",
        target="agent:my-support-agent",    # hosted Orq agent, routed via ORQ_API_KEY
        personas=[persona],
        scenarios=[scenario],
        max_turns=6,
        evaluator_names=["goal_achieved", "criteria_met"],
    )

    result = results[0]
    score = result.goal_completion_score or 0.0
    print(f"Goal achieved: {result.goal_achieved}  score={score:.2f}")
    for msg in result.messages:
        who = "User" if msg.role == "user" else "Agent"
        print(f"{who}: {msg.content}")


if __name__ == "__main__":
    asyncio.run(main())

Use target_callback= with any async function that maps the conversation to your agent's reply. Pass sim_model= to run the simulator and judge on OpenAI directly. Set upload_results=False for a local-only run.

import asyncio

from openai import AsyncOpenAI

from evaluatorq.contracts import Message
from evaluatorq.simulation import simulate
from evaluatorq.simulation.types import CommunicationStyle, Criterion, Persona, Scenario

client = AsyncOpenAI()

SYSTEM = "You are a customer support agent for Acme Corp. Be concise and helpful."


async def openai_agent(messages: list[Message]) -> str:
    """Your agent under test — a raw OpenAI model."""
    history = [{"role": "system", "content": SYSTEM}]
    history += [{"role": m.role, "content": m.content or ""} for m in messages]
    resp = await client.chat.completions.create(model="gpt-4o-mini", messages=history)
    return resp.choices[0].message.content or ""


async def main():
    persona = Persona(
        name="Impatient Customer",
        patience=0.2, assertiveness=0.8, politeness=0.4, technical_level=0.3,
        communication_style=CommunicationStyle.terse,
        background="Received the wrong item and wants a refund urgently",
    )
    scenario = Scenario(
        name="Wrong Item Refund",
        goal="Get a full refund for the wrong item received",
        criteria=[
            Criterion(description="Agent asks for order details", type="must_happen"),
        ],
    )

    results = await simulate(
        evaluation_name="openai-agent-simulation",
        target_callback=openai_agent,        # your OpenAI agent
        personas=[persona],
        scenarios=[scenario],
        sim_model="gpt-4o-mini",             # simulator + judge on OpenAI directly
        max_turns=6,
        evaluator_names=["goal_achieved", "criteria_met"],
        upload_results=False,                # local-only run, no Orq experiment
    )

    result = results[0]
    score = result.goal_completion_score or 0.0
    print(f"Goal achieved: {result.goal_achieved}  score={score:.2f}")


if __name__ == "__main__":
    asyncio.run(main())

One persona × one scenario yields one SimulationResult with goal_achieved, goal_completion_score, turn_count, rules_broken, and the full message transcript.

The target_callback is the only structural difference from the Orq path — personas, scenarios, criteria, and the result shape are identical. Swap the callback body for any HTTP/LLM agent.

View results in the local dashboard

Run eq dashboard to browse saved red-team and simulation reports together; eq redteam ui / eq sim ui open the legacy Streamlit views for one surface.

Where to next