Changelog¶
All notable changes to evaluatorq are documented here.
[1.3.0] — unreleased¶
Notable defaults¶
EVALUATORQ_SPAN_MAX_TEXT_CHARSdefaults to capturing all message content (no truncation), in both the Python and TypeScript tracing layers. Set the env var to a positive integer (canonical:8192) to cap span text at that many characters (marker... [truncated]);-1,0, or unset all mean capture all. The cap applies uniformly to input and output message content. (RES-715 introduced an8192default; RES-899 reverts to capture-all and unifies the TS path, which previously hardcoded a separate2000-char cap.)loguruis now a core dependency (previously gated behind the[redteam]extra). This slightly widens the install footprint for non-redteam consumers but unifies the logging stack across the package.openai(>=1.92.0) is now a core dependency (previously gated behind the[redteam]extra). The newllm_jury()evaluator imports it at package load, so every base install pulls it; this widens the base footprint for users who only callevaluate(), in exchange forllm_jury()working without an extra.
Breaking Changes¶
red_team()parameter renamed:config=→llm_config=. The oldconfig=keyword still works in 1.3.0 but emits aDeprecationWarningand will be removed in 1.4.0.LLMConfigflat fields removed:attack_model,evaluator_model,adversarial_temperature,adversarial_max_tokens,llm_call_timeout_ms,llm_kwargs— replaced by role-basedattacker/evaluatorsub-configs (LLMCallConfig)wrap_simulation_agent()no longer accepts theevaluators=kwarg. Evaluators are wired throughevaluatorq()directly (the framework that consumes the job); callers passingevaluators=[...]will now get aTypeErrorand should move the list onto theirevaluatorq(..., evaluators=...)call instead (RES-594).simulate()andgenerate_and_simulate()no longer acceptagent_key=. The singletarget=parameter now selects the target:"agent:<key>"or a bare"<key>"(hosted Orq agent via the Responses router),"deployment:<key>"(legacy deployment), anAgentTarget, or a callable. Callers passingagent_key=...get aTypeError; migrate totarget="deployment:<key>"(ortarget="agent:<key>"). Theeq sim simulate/eq sim runCLI drops its matching--agent-keyflag — use--target deployment:<key>.simulate()andgenerate_and_simulate()now defaultupload_results=True. With the move to evaluatorq-native execution the framework's upload is the canonical persistence path — the previousFalsedefault left runs with no record anywhere. Setupload_results=Falseexplicitly to suppress (RES-594).
Migration:
# Before
red_team(target, config=LLMConfig(attack_model="gpt-4o", evaluator_model="gpt-4o-mini"))
# After
from evaluatorq.redteam.contracts import LLMCallConfig, LLMConfig
red_team(
target,
llm_config=LLMConfig(
attacker=LLMCallConfig(model="gpt-4o"),
evaluator=LLMCallConfig(model="gpt-4o-mini"),
),
)
AgentTargetrelocated: moved fromevaluatorq.redteam.backends.basetoevaluatorq.contracts. Importing it from the old path now raisesImportError. TheBackendABC stays inevaluatorq.redteam.backends.base.AgentContext,ToolInfo,MemoryStoreInfo, andKnowledgeBaseInfoalso moved toevaluatorq.contracts, but — unlikeAgentTarget— their old import pathevaluatorq.redteam.contractsstill works (re-exported, same class objects,isinstanceunaffected). OnlyAgentTarget's old path is a hard break.
Migration:
# Before
from evaluatorq.redteam.backends.base import AgentTarget
# After
from evaluatorq.contracts import AgentTarget
AgentTargetunified onrespond(messages):respond(messages: list[Message]) -> AgentResponseis now the abstract method every target implements.send_prompt(prompt: str) -> AgentResponseis retained as a concrete back-compat shim on the ABC — it wraps the prompt in a single user message and callsrespond. Custom targets that previously implemented onlysend_promptmust implementrespondinstead.
Migration (bare custom subclass):
# Before — only send_prompt was abstract
from evaluatorq.contracts import AgentResponse, AgentTarget
class MyTarget(AgentTarget):
async def send_prompt(self, prompt: str) -> AgentResponse:
return AgentResponse(text=await my_llm_call(prompt))
def new(self) -> "MyTarget":
return MyTarget()
# After — respond is the abstract method; send_prompt is a free shim on the ABC
from evaluatorq.contracts import AgentResponse, AgentTarget, Message
class MyTarget(AgentTarget):
async def respond(self, messages: list[Message]) -> AgentResponse:
prompt = messages[-1].content or ""
return AgentResponse(text=await my_llm_call(prompt))
def new(self) -> "MyTarget":
return MyTarget()
OrqResponsesTarget is now stateless: __call__, _previous_response_id threading, _accumulated_usage, and get_usage() are removed. Conversation continuity is the caller's responsibility — pass the full transcript to respond each turn. Pass the target to simulate(target=...) (auto-routes to the target-agent path) or simulate(target_agent=...) instead of relying on __call__. Per-call token usage is reported on the returned AgentResponse.usage. - ORQAgentTarget last-user contract: respond(messages) forwards only the last user message to the ORQ agents endpoint (server-side state is held via task_id) and raises ValueError if messages[-1].role != "user". The endpoint, task_id threading, and usage accumulation are unchanged. - ChatMessage alias removed: the RES-596 deprecated alias ChatMessage = Message is gone. Import Message from evaluatorq.contracts (the public evaluatorq.simulation.ChatMessage re-export is also removed). - Simulation TargetAgent Protocol removed: the simulation runner consumes the canonical AgentTarget ABC from evaluatorq.contracts. The evaluatorq.simulation.TargetAgent / evaluatorq.simulation.runner.TargetAgent exports are replaced by AgentTarget. Migration:
# Before
from evaluatorq.simulation.types import ChatMessage
from evaluatorq.simulation import TargetAgent
# After
from evaluatorq.contracts import Message # ChatMessage was an alias of Message
from evaluatorq.contracts import AgentTarget # replaces the simulation TargetAgent Protocol
CallableTargetforwards the full transcript: the wrapped callable now receives the entire conversation as alist[Message](previously only the last user turn as astr), so stateless callables retain context across multi-turn attacks. The callable signature changes from(prompt: str)to(messages: list[Message]), andusage_fnfrom(prompt: str, response: str)to(messages: list[Message], response: str). The former last-turn-must-be-user guard is dropped (matching the other stateless targets). Callables that need OpenAI chat-completion dicts can callMessage.to_chat_completion()per element.
Migration:
from evaluatorq.contracts import Message
from evaluatorq.integrations.callable_integration import CallableTarget
# Before
target = CallableTarget(lambda prompt: my_agent(prompt))
# After — read the last turn off the transcript
target = CallableTarget(lambda messages: my_agent(messages[-1].content or ""))
New Features¶
llm_jury()— LLM-as-a-jury evaluator forevaluatorq(evaluators=[...]). A single judge or a panel rates a target output against criteria; verdicts can be boolean (default), labeled categorical (labels=+passing_labels=), or numeric (verdict_kind="numeric"+threshold=). The panel consensus rule is selectable viaaggregator=:"mode"(default) or"majority"(strict >50%) for categorical,"mean_std"(default) /"median"/"min"/"max"for numeric, or a customCallable[[list[JuryVote]], ...]. Uses structured generation (tiered.parse→json_objectfallback) and resolves the LLM client lazily on first scorer call so declaring an evaluator never requires credentials. The Responses-API path is deferred (RES-972). (RES-848)OWASP_LLM_TOP_10andOWASP_ASI_TOP_10— publiclist[str]constants exported fromevaluatorq.redteam. Pass them tored_team(categories=OWASP_LLM_TOP_10)to run a full framework sweep without spelling out individual category codes (RES-815).simulate()andgenerate_and_simulate()accept a new opt-inupload_results=flag (defaultFalse). When set toTrue, results are uploaded to the Orq platform after the run, surfacing as an experiment whenORQ_API_KEYis configured. Upload errors are logged but never fail the call. Both functions also acceptevaluation_description=andpath=parameters mirroringevaluatorq()(RES-598).LLMCallConfig— per-role LLM configuration withmodel,temperature,max_tokens,timeout_ms,extra_kwargs, andclientfieldsLLMConfig— now role-based viaattacker: LLMCallConfigandevaluator: LLMCallConfig; retry, cleanup, and target-agent timeout settings retained at top levelLLMCallConfigexported from theevaluatorq.redteampublic APIOpenAIModelTarget.send_promptnow enforcestimeout_msviaasyncio.wait_for- Evaluator role config (
temperature,max_tokens,timeout_ms,extra_kwargs,client) fully propagated throughOWASPEvaluator,create_dynamic_evaluator, andcreate_owasp_evaluator simulate()andgenerate_and_simulate()accept newevaluation_description=andpath=parameters, forwarded straight toevaluatorq()(RES-598).simulate()andgenerate_and_simulate()now run on top ofevaluatorq(): persona × scenario datapoints are materialised, executed via a single evaluatorq job, and scored via adapted evaluators. This brings auto-upload, OTel tracing, the results table, CI gating, and dataset-id support to the simulation entry points "for free". The bespoke parallelism loop was removed;simulation/upload.pyis kept as a standalone helper for direct callers but is no longer invoked fromsimulate()(RES-594).simulate()accepts a newdataset_id=parameter — when set, simulation datapoints are streamed from the named Orq dataset (each row'sinputsmust already match a simulation input shape) instead of being passed inline. Mutually exclusive withdatapointsandpersonas/scenarios(RES-594).simulate()andgenerate_and_simulate()accept a newexit_on_failure=parameter, defaultTrue, matchingevaluatorq()'s framework default. Score-based failures exit viasys.exit(1); dropped jobs raiseRuntimeError. Passexit_on_failure=Falsefor interactive / exploratory runs where you want failures surfaced as warnings + error metadata instead of a non-zero exit (RES-594).
Bug Fixes¶
safe_substitute()dict keys were broken by Ruff RUF027 auto-fix inattack_generator,capability_classifier, andobjective_generator— LLM prompts were receiving unsubstituted{placeholder}text, silently producing degraded attacksgenerate_recommendations=Truenow correctly usesllm_config.evaluator.clientbefore falling back tocreate_async_llm_client()- All hardcoded timeout literals (
240_000,90_000) replaced with config-driven values fromLLMConfig/DEFAULT_TARGET_TIMEOUT_MS OpenAITargetFactorynow propagatesmax_tokensandtimeout_msto created targets
Internal¶
SaveModeconverted fromLiteraltoStrEnum- Timeout defaults centralised in
contracts.py(DEFAULT_TARGET_TIMEOUT_MS = 240_000);PIPELINE_CONFIGimport removed fromopenai.pyandregistry.py MultiTurnOrchestrator.llm_kwargsconstructor param deprecated — merged into_cfg.attacker.extra_kwargsat init time; useLLMCallConfig.extra_kwargsinstead- RUF027 added to Ruff ignore list (intentional literal string keys used as
safe_substitutetemplate placeholders) - CLI
--saveflag migrated totyper.Choice - Ruff cleanup across all redteam modules (import sorting,
Optional[X]→X | None,TYPE_CHECKINGguards)
Breaking Changes (RES-877)¶
AgentTarget.send_promptremoved:respond(messages: list[Message]) -> AgentResponseis now the sole response method on every target; callers own the conversation transcript. Migratetarget.send_prompt("x")totarget.respond([Message(role="user", content="x")]).OpenAIModelTarget,VercelAISdkTarget, andOpenAIAgentTargetare now stateless: per-instance_historyis gone. Multi-turn conversation state is owned by the red-team orchestrator, not the target.evaluatorq.redteam.ErrorInforenamed toRunError: update any imports orisinstancechecks that reference the old name.
Migration:
# Before
response = await target.send_prompt("Hello")
# After
from evaluatorq.contracts import Message
response = await target.respond([Message(role="user", content="Hello")])
New Features (RES-877)¶
AgentResponseError— a per-response error marker exposed onAgentResponse.error; used by the orchestrator to exclude failed turns from the replayed transcript.turns_to_messages(turns, *, skip_errors=False)— helper exported fromevaluatorq.redteam.contractsthat converts a list of completed turns into a flatlist[Message], optionally dropping turns whose response carries anAgentResponseError.classify_error_type(error, *, existing_type=None)— exported fromevaluatorq.redteam.contracts; infers a coarseerror_type(content_filter,rate_limit,timeout,network_error,server_error,client_error, orunknown) from an error string. Shared by the orchestrator and report converters. On a per-responseAgentResponseError, the orchestrator records an unmatched (unknown) result astarget_error, so that field never carriesunknown.- Tool-call fidelity on replay — the transcript replayed to a target now preserves assistant
tool_callsandtoolresults across turns (OpenAIModelTargetas OpenAI chat params,VercelAISdkTargetas AI SDK CoreMessagetool-call/tool-resultparts,OpenAIAgentTargetas Responses-APIfunction_call/function_call_outputitems), so multi-turn tool-using agents see their prior tool context.VercelAISdkTargetacceptsmessage_format="v5"(default) or"v4"to match the endpoint's AI SDK version (input/output:{type,value}vsargs/result). Errored turns recorded by the orchestrator now carry a classifiedAgentResponseError.error_typeinstead of a flattarget_error.
Internal (RES-899)¶
- Unified tracing layer: the generic OTel span-recording helpers previously duplicated across
redteam/tracing.pyandsimulation/tracing.pynow live in a singleevaluatorq.common.tracingmodule (truncate_for_span,capture_message_content,record_token_usage,record_llm_response,record_llm_input/output,set_span_attrs,get_trace_context_headers). Domain-specific span builders (with_redteam_span,with_simulation_span,with_llm_span) stay in their domain modules and import the shared helpers. The common module never imports fromredteam,simulation, oropenresponses.
Changed (RES-899)¶
- Span PII gate env var renamed to
EVALUATORQ_CAPTURE_MESSAGE_CONTENT(defaulttrue), replacing the previousOTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT. The same name now gates both the Python and TypeScript simulation/red-team tracing layers. Setfalse/0to keep raw prompt and response text off spans (token usage, model, finish reason, and latency are still recorded). - Span text truncation defaults to capture-all in both Python and TypeScript.
EVALUATORQ_SPAN_MAX_TEXT_CHARSis unset by default (no truncation); set a positive integer (canonical:8192) to cap input and output message content, with the shared... [truncated]marker.-1/0/ unset all mean capture all. The TypeScript path previously hardcoded a separate2000-char cap with a…marker — both are gone.
Fixed (RES-899)¶
retry_statusesaugments the default set again: passing a custom set (e.g.{429}) no longer silently drops the built-in429 + 5xxretries — the custom statuses are added to the defaults, not substituted for them. (This restores the intended RES-897 review behavior, which was lost when #150 merged without the fix.)