`evaluatorq`¶

EvaluatorQ Python - An evaluation framework for LLM applications.

`DataPointInput = DataPoint | DataPointDict` `module-attribute` ¶

Type alias for DataPoint that accepts both model instances and dicts.

`EvaluationResultCellValue = str | int | float | dict[str, str | float | dict[str, str | float]]` `module-attribute` ¶

`EvaluatorqResult = list[DataPointResult]` `module-attribute` ¶

Type alias for evaluation results

`Job = Callable[[DataPoint, int], Awaitable[dict[str, Any]]]` `module-attribute` ¶

Job function type - returns a dict with 'name' and 'output' keys

`Output = str | int | float | bool | dict[str, Any] | None` `module-attribute` ¶

Output type alias

`Scorer = Callable[[ScorerParameter], Awaitable[EvaluationResult | dict[str, Any]]]` `module-attribute` ¶

`DataPoint` ¶

Bases: BaseModel

A data point for evaluation.

Parameters:

Name	Type	Description	Default
`inputs`		The inputs to pass to the job.	required
`expected_output`		The expected output of the data point. Used for evaluation and comparing the output of the job.	required

`DataPointDict` ¶

Bases: _DataPointDictRequired

Dict representation of a DataPoint for type checking.

`DataPointResult` ¶

Bases: BaseModel

`DatasetIdInput` ¶

Bases: BaseModel

Input for fetching a dataset from Orq platform.

`DeploymentResponse` `dataclass` ¶

Response from a deployment invocation.

`content` `instance-attribute` ¶

The text content of the response

`raw` `instance-attribute` ¶

The raw response from the API

`EvaluationResult` ¶

Bases: BaseModel

`EvaluationResultCell` ¶

Bases: BaseModel

`Evaluator` ¶

Bases: TypedDict

`EvaluatorParams` ¶

Bases: BaseModel

Parameters for running an evaluation.

Parameters:

Name	Description	Default
`data`	The data to evaluate. A DatasetIdInput to fetch from Orq platform, an ExperimentInput to replay an experiment's recorded responses (requires inference=False), or a list of DataPoint instances/awaitables.	required
`jobs`	The jobs to run on the data.	required
`evaluators`	The evaluators to use. If not provided, only jobs will run.	required
`parallelism`	Number of jobs to run in parallel. Defaults to 1 (sequential).	required
`print_results`	Whether to print results table to console. Defaults to True. Also accepts "print" as an alias.	required
`description`	Optional description for the evaluation run.	required
`path`	Optional path (e.g. "MyProject/MyFolder") to place the experiment in a specific project and folder on the Orq platform.	required

`inference = True` `class-attribute` `instance-attribute` ¶

When False, skip generation and evaluate the pre-recorded response in each row's messages column instead of running jobs.

`EvaluatorScore` ¶

Bases: BaseModel

`ExperimentInput` ¶

Bases: BaseModel

Input for sourcing pre-recorded responses from an Orq experiment.

Used with inference=False to re-run evaluators against the responses an earlier experiment already produced, without regenerating them.

`experiment_id` `instance-attribute` ¶

The experiment ID to load responses from. Read it off the experiment URL in the Orq UI (/experiments/<experiment_id>). The API refers to experiments as "spreadsheets", so you will also see this ID in /v2/spreadsheets/<id> routes.

`run_id = None` `class-attribute` `instance-attribute` ¶

A specific run ID (a "manifest" in the API). When omitted, the latest run is used. Every execution of an experiment creates a new run; open it from the experiment's run history to read its ID from the URL.

`JobResult` ¶

Bases: BaseModel

`JobReturn` ¶

Bases: TypedDict

Job return structure

`MessageDict` ¶

Bases: TypedDict

Chat message structure compatible with Orq SDK.

`ScorerParameter` ¶

Bases: TypedDict

Parameters passed to a scorer function Args: data: The data point being evaluated. output: The output produced by the job for the data point.

`ThreadConfig` ¶

Bases: TypedDict

Thread configuration for conversation tracking.

`exact_match_evaluator(*, case_insensitive=False, name='exact-match')` ¶

Creates an evaluator that checks if the output exactly matches the expected output. Uses the data.expected_output from the dataset to compare against.

Parameters:

Name	Type	Description	Default
`case_insensitive`	`bool`	Whether the comparison should be case-insensitive	`False`
`name`	`str`	Optional name for the evaluator	`'exact-match'`

Returns:

Type	Description
`Evaluator`	An Evaluator that checks if output exactly matches expected output

Example

Basic usage (case-sensitive)¶

evaluator = exact_match_evaluator()

With case-insensitive matching¶

loose_evaluator = exact_match_evaluator(case_insensitive=True)

`invoke(key, inputs=None, context=None, metadata=None, thread=None, messages=None)` `async` ¶

Invoke an Orq deployment and return just the text content. This is a convenience wrapper around deployment() for simple use cases.

Parameters:

Name	Type	Description	Default
`key`	`str`	The deployment key (name)	required
`inputs`	`dict[str, object] \| None`	Input variables for the deployment template	`None`
`context`	`dict[str, object] \| None`	Context attributes for routing	`None`
`metadata`	`dict[str, object] \| None`	Metadata to attach to the request	`None`
`thread`	`ThreadConfig \| None`	Thread configuration for conversation tracking. Must include 'id' key.	`None`
`messages`	`list[MessageDict] \| None`	Chat messages for conversational deployments	`None`

Returns:

Type	Description
`str`	The text content of the response

Example

In a job¶

@job("my-job") async def my_job(data, row): return await invoke("summarizer", inputs=data.inputs)

`job(name, fn=None)` ¶

job(
    name: str,
) -> Callable[
    [
        Callable[
            [DataPoint, int], Awaitable[Output] | Output
        ]
    ],
    Job,
]

job(
    name: str,
    fn: Callable[
        [DataPoint, int], Awaitable[Output] | Output
    ],
) -> Job

Helper function/decorator to create a named job that ensures the job name is preserved even when errors occur during execution.

This wrapper: - Automatically formats the return value as {"name": ..., "output": ...} - Attaches the job name to errors for better error tracking - Can be used as a decorator (@job("name")) or function (job("name", fn))

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the job	required
`fn`	`Callable[[DataPoint, int], Awaitable[Output] \| Output] \| None`	The job function that returns the output (optional when used as decorator)	`None`

Returns:

Type	Description
`Job \| Callable[[Callable[[DataPoint, int], Awaitable[Output] \| Output]], Job]`	A Job function that always includes the job name

Example

# As a decorator:
@job("text-analyzer")
async def analyze_text(data: DataPoint, row: int):
    return {"length": len(data.inputs["text"])}

# As a function wrapper:
my_job = job("my-job", async_function)

# With lambda for simple cases:
uppercase_job = job("uppercase", lambda data, row: data.inputs["text"].upper())

`string_contains_evaluator(case_insensitive=True, name='string-contains')` ¶

Creates an evaluator that checks if the output contains the expected output. Uses the data.expected_output from the dataset to compare against.

Parameters:

Name	Type	Description	Default
`case_insensitive`	`bool`	Whether the comparison should be case-insensitive	`True`
`name`	`str`	Optional name for the evaluator	`'string-contains'`

Returns:

Type	Description
`Evaluator`	An Evaluator that checks if output contains expected output

Example

Basic usage¶

evaluator = string_contains_evaluator()

With case-sensitive matching¶

strict_evaluator = string_contains_evaluator(case_insensitive=False)

With custom name¶

my_evaluator = string_contains_evaluator(name="my-contains-check")

evaluatorq¶

DataPointInput = DataPoint | DataPointDict module-attribute ¶

EvaluationResultCellValue = str | int | float | dict[str, str | float | dict[str, str | float]] module-attribute ¶

EvaluatorqResult = list[DataPointResult] module-attribute ¶

Job = Callable[[DataPoint, int], Awaitable[dict[str, Any]]] module-attribute ¶

Output = str | int | float | bool | dict[str, Any] | None module-attribute ¶

Scorer = Callable[[ScorerParameter], Awaitable[EvaluationResult | dict[str, Any]]] module-attribute ¶

DataPoint ¶

DataPointDict ¶

DataPointResult ¶

DatasetIdInput ¶

DeploymentResponse dataclass ¶

content instance-attribute ¶

raw instance-attribute ¶

EvaluationResult ¶

EvaluationResultCell ¶

Evaluator ¶

EvaluatorParams ¶

inference = True class-attribute instance-attribute ¶

EvaluatorScore ¶

ExperimentInput ¶

experiment_id instance-attribute ¶

run_id = None class-attribute instance-attribute ¶

JobResult ¶

JobReturn ¶

MessageDict ¶

ScorerParameter ¶

ThreadConfig ¶

exact_match_evaluator(*, case_insensitive=False, name='exact-match') ¶

Basic usage (case-sensitive)¶

With case-insensitive matching¶

invoke(key, inputs=None, context=None, metadata=None, thread=None, messages=None) async ¶

In a job¶

job(name, fn=None) ¶

string_contains_evaluator(case_insensitive=True, name='string-contains') ¶

Basic usage¶

With case-sensitive matching¶

With custom name¶

`evaluatorq`¶

`DataPointInput = DataPoint | DataPointDict` `module-attribute` ¶

`EvaluationResultCellValue = str | int | float | dict[str, str | float | dict[str, str | float]]` `module-attribute` ¶

`EvaluatorqResult = list[DataPointResult]` `module-attribute` ¶

`Job = Callable[[DataPoint, int], Awaitable[dict[str, Any]]]` `module-attribute` ¶

`Output = str | int | float | bool | dict[str, Any] | None` `module-attribute` ¶

`Scorer = Callable[[ScorerParameter], Awaitable[EvaluationResult | dict[str, Any]]]` `module-attribute` ¶

`DataPoint` ¶

`DataPointDict` ¶

`DataPointResult` ¶

`DatasetIdInput` ¶

`DeploymentResponse` `dataclass` ¶

`content` `instance-attribute` ¶

`raw` `instance-attribute` ¶

`EvaluationResult` ¶

`EvaluationResultCell` ¶

`Evaluator` ¶

`EvaluatorParams` ¶

`inference = True` `class-attribute` `instance-attribute` ¶

`EvaluatorScore` ¶

`ExperimentInput` ¶

`experiment_id` `instance-attribute` ¶

`run_id = None` `class-attribute` `instance-attribute` ¶

`JobResult` ¶

`JobReturn` ¶

`MessageDict` ¶

`ScorerParameter` ¶

`ThreadConfig` ¶

`exact_match_evaluator(*, case_insensitive=False, name='exact-match')` ¶

`invoke(key, inputs=None, context=None, metadata=None, thread=None, messages=None)` `async` ¶

`job(name, fn=None)` ¶

`string_contains_evaluator(case_insensitive=True, name='string-contains')` ¶