Skip to content

evaluatorq

EvaluatorQ Python - An evaluation framework for LLM applications.

DataPointInput = DataPoint | DataPointDict module-attribute

Type alias for DataPoint that accepts both model instances and dicts.

EvaluationResultCellValue = str | int | float | dict[str, str | float | dict[str, str | float]] module-attribute

EvaluatorqResult = list[DataPointResult] module-attribute

Type alias for evaluation results

Job = Callable[[DataPoint, int], Awaitable[dict[str, Any]]] module-attribute

Job function type - returns a dict with 'name' and 'output' keys

Output = str | int | float | bool | dict[str, Any] | None module-attribute

Output type alias

Scorer = Callable[[ScorerParameter], Awaitable[EvaluationResult | dict[str, Any]]] module-attribute

DataPoint

Bases: BaseModel

A data point for evaluation.

Parameters:

Name Type Description Default
inputs

The inputs to pass to the job.

required
expected_output

The expected output of the data point. Used for evaluation and comparing the output of the job.

required

DataPointDict

Bases: _DataPointDictRequired

Dict representation of a DataPoint for type checking.

DataPointResult

Bases: BaseModel

DatasetIdInput

Bases: BaseModel

Input for fetching a dataset from Orq platform.

DeploymentResponse dataclass

Response from a deployment invocation.

content instance-attribute

The text content of the response

raw instance-attribute

The raw response from the API

EvaluationResult

Bases: BaseModel

EvaluationResultCell

Bases: BaseModel

Evaluator

Bases: TypedDict

EvaluatorParams

Bases: BaseModel

Parameters for running an evaluation.

Parameters:

Name Type Description Default
data

The data to evaluate. A DatasetIdInput to fetch from Orq platform, an ExperimentInput to replay an experiment's recorded responses (requires inference=False), or a list of DataPoint instances/awaitables.

required
jobs

The jobs to run on the data.

required
evaluators

The evaluators to use. If not provided, only jobs will run.

required
parallelism

Number of jobs to run in parallel. Defaults to 1 (sequential).

required
print_results

Whether to print results table to console. Defaults to True. Also accepts "print" as an alias.

required
description

Optional description for the evaluation run.

required
path

Optional path (e.g. "MyProject/MyFolder") to place the experiment in a specific project and folder on the Orq platform.

required

inference = True class-attribute instance-attribute

When False, skip generation and evaluate the pre-recorded response in each row's messages column instead of running jobs.

EvaluatorScore

Bases: BaseModel

ExperimentInput

Bases: BaseModel

Input for sourcing pre-recorded responses from an Orq experiment.

Used with inference=False to re-run evaluators against the responses an earlier experiment already produced, without regenerating them.

experiment_id instance-attribute

The experiment ID to load responses from. Read it off the experiment URL in the Orq UI (/experiments/<experiment_id>). The API refers to experiments as "spreadsheets", so you will also see this ID in /v2/spreadsheets/<id> routes.

run_id = None class-attribute instance-attribute

A specific run ID (a "manifest" in the API). When omitted, the latest run is used. Every execution of an experiment creates a new run; open it from the experiment's run history to read its ID from the URL.

JobResult

Bases: BaseModel

JobReturn

Bases: TypedDict

Job return structure

MessageDict

Bases: TypedDict

Chat message structure compatible with Orq SDK.

ScorerParameter

Bases: TypedDict

Parameters passed to a scorer function Args: data: The data point being evaluated. output: The output produced by the job for the data point.

ThreadConfig

Bases: TypedDict

Thread configuration for conversation tracking.

exact_match_evaluator(*, case_insensitive=False, name='exact-match')

Creates an evaluator that checks if the output exactly matches the expected output. Uses the data.expected_output from the dataset to compare against.

Parameters:

Name Type Description Default
case_insensitive bool

Whether the comparison should be case-insensitive

False
name str

Optional name for the evaluator

'exact-match'

Returns:

Type Description
Evaluator

An Evaluator that checks if output exactly matches expected output

Example

Basic usage (case-sensitive)

evaluator = exact_match_evaluator()

With case-insensitive matching

loose_evaluator = exact_match_evaluator(case_insensitive=True)

invoke(key, inputs=None, context=None, metadata=None, thread=None, messages=None) async

Invoke an Orq deployment and return just the text content. This is a convenience wrapper around deployment() for simple use cases.

Parameters:

Name Type Description Default
key str

The deployment key (name)

required
inputs dict[str, object] | None

Input variables for the deployment template

None
context dict[str, object] | None

Context attributes for routing

None
metadata dict[str, object] | None

Metadata to attach to the request

None
thread ThreadConfig | None

Thread configuration for conversation tracking. Must include 'id' key.

None
messages list[MessageDict] | None

Chat messages for conversational deployments

None

Returns:

Type Description
str

The text content of the response

Example

In a job

@job("my-job") async def my_job(data, row): return await invoke("summarizer", inputs=data.inputs)

job(name, fn=None)

job(
    name: str,
) -> Callable[
    [
        Callable[
            [DataPoint, int], Awaitable[Output] | Output
        ]
    ],
    Job,
]
job(
    name: str,
    fn: Callable[
        [DataPoint, int], Awaitable[Output] | Output
    ],
) -> Job

Helper function/decorator to create a named job that ensures the job name is preserved even when errors occur during execution.

This wrapper: - Automatically formats the return value as {"name": ..., "output": ...} - Attaches the job name to errors for better error tracking - Can be used as a decorator (@job("name")) or function (job("name", fn))

Parameters:

Name Type Description Default
name str

The name of the job

required
fn Callable[[DataPoint, int], Awaitable[Output] | Output] | None

The job function that returns the output (optional when used as decorator)

None

Returns:

Type Description
Job | Callable[[Callable[[DataPoint, int], Awaitable[Output] | Output]], Job]

A Job function that always includes the job name

Example
# As a decorator:
@job("text-analyzer")
async def analyze_text(data: DataPoint, row: int):
    return {"length": len(data.inputs["text"])}

# As a function wrapper:
my_job = job("my-job", async_function)

# With lambda for simple cases:
uppercase_job = job("uppercase", lambda data, row: data.inputs["text"].upper())

string_contains_evaluator(case_insensitive=True, name='string-contains')

Creates an evaluator that checks if the output contains the expected output. Uses the data.expected_output from the dataset to compare against.

Parameters:

Name Type Description Default
case_insensitive bool

Whether the comparison should be case-insensitive

True
name str

Optional name for the evaluator

'string-contains'

Returns:

Type Description
Evaluator

An Evaluator that checks if output contains expected output

Example

Basic usage

evaluator = string_contains_evaluator()

With case-sensitive matching

strict_evaluator = string_contains_evaluator(case_insensitive=False)

With custom name

my_evaluator = string_contains_evaluator(name="my-contains-check")