evaluatorq¶

Run LLM evaluations, red-team agents, and simulate multi-turn conversations in Python — against any agent, with the Orq AI platform as optional infrastructure.

Get Started View on GitHub

Install¶

pip install evaluatorqSuccessfully installed evaluatorq

Optional extras

pip install "evaluatorq[redteam]" adds adversarial red teaming · pip install "evaluatorq[simulation]" adds multi-turn agent simulation.

What it does¶

Evaluations

Run jobs over inline data or Orq datasets in parallel; score with custom or built-in evaluators; gate CI on pass/fail.

Getting Started
Agent simulation

A user-simulator LLM drives your agent across multi-turn conversations while a judge LLM scores whether it met its goals.

Agent simulation
Red teaming

Adaptive adversarial attacks mapped to the OWASP LLM Top 10 and Agentic Security Initiative, with auto-discovered tool and memory attack surfaces.

Red teaming

Works with LangGraph, OpenAI Agents SDK, PydanticAI, CrewAI, a plain async function, or an Orq deployment. The Orq platform is optional: it stores results and, when ORQ_API_KEY is set, routes the attacker and judge LLMs by default — but you can bring your own and run entirely on OpenAI.

Quick look¶

import asyncio

from evaluatorq import (
    DataPoint,
    evaluatorq,
    job,
    string_contains_evaluator,
)


@job("greet")
async def greet_job(data: DataPoint, _row: int) -> str:
    name = str(data.inputs.get("name", ""))
    return f"Hello, {name}!"


async def main():
    data = [
        DataPoint(inputs={"name": "Ada"}, expected_output="Hello, Ada!"),
        DataPoint(inputs={"name": "Lin"}, expected_output="Hello, Lin!"),
    ]
    await evaluatorq(
        "smoke-test",
        data=data,
        jobs=[greet_job],
        evaluators=[string_contains_evaluator()],
        print_results=True,
    )


asyncio.run(main())

print_results=True renders a summary and a per-evaluator score panel:

EVALUATION RESULTS

Summary:
╭──────────────────────┬───────╮
│ Metric               │ Value │
├──────────────────────┼───────┤
│ Total Data Points    │ 2     │
│ Failed Data Points   │ 0     │
│ Total Jobs           │ 2     │
│ Failed Jobs          │ 0     │
│ Success Rate         │ 100%  │
╰──────────────────────┴───────╯

Detailed Results:
╭──────────────────┬───────╮
│ Evaluators       │ greet │
├──────────────────┼───────┤
│ string-contains  │ 1.00  │
╰──────────────────┴───────╯

Where to next¶

Getting Started — your first evaluation in five minutes.
Examples — runnable scripts across every capability.
Custom Evaluators & Frameworks — extend the registries.
API Reference — the full public API.
Roadmap — what's planned next.