evaluatorq¶
EvaluatorQ Python - An evaluation framework for LLM applications.
DataPointInput = DataPoint | DataPointDict module-attribute ¶
Type alias for DataPoint that accepts both model instances and dicts.
EvaluationResultCellValue = str | int | float | dict[str, str | float | dict[str, str | float]] module-attribute ¶
EvaluatorqResult = list[DataPointResult] module-attribute ¶
Type alias for evaluation results
Job = Callable[[DataPoint, int], Awaitable[dict[str, Any]]] module-attribute ¶
Job function type - returns a dict with 'name' and 'output' keys
Output = str | int | float | bool | dict[str, Any] | None module-attribute ¶
Output type alias
Scorer = Callable[[ScorerParameter], Awaitable[EvaluationResult | dict[str, Any]]] module-attribute ¶
DataPoint ¶
Bases: BaseModel
A data point for evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs | The inputs to pass to the job. | required | |
expected_output | The expected output of the data point. Used for evaluation and comparing the output of the job. | required |
DataPointDict ¶
Bases: _DataPointDictRequired
Dict representation of a DataPoint for type checking.
DataPointResult ¶
Bases: BaseModel
DatasetIdInput ¶
Bases: BaseModel
Input for fetching a dataset from Orq platform.
DeploymentResponse dataclass ¶
EvaluationResult ¶
Bases: BaseModel
EvaluationResultCell ¶
Bases: BaseModel
Evaluator ¶
Bases: TypedDict
EvaluatorParams ¶
Bases: BaseModel
Parameters for running an evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data | The data to evaluate. A DatasetIdInput to fetch from Orq platform, an ExperimentInput to replay an experiment's recorded responses (requires inference=False), or a list of DataPoint instances/awaitables. | required | |
jobs | The jobs to run on the data. | required | |
evaluators | The evaluators to use. If not provided, only jobs will run. | required | |
parallelism | Number of jobs to run in parallel. Defaults to 1 (sequential). | required | |
print_results | Whether to print results table to console. Defaults to True. Also accepts "print" as an alias. | required | |
description | Optional description for the evaluation run. | required | |
path | Optional path (e.g. "MyProject/MyFolder") to place the experiment in a specific project and folder on the Orq platform. | required |
inference = True class-attribute instance-attribute ¶
When False, skip generation and evaluate the pre-recorded response in each row's messages column instead of running jobs.
EvaluatorScore ¶
Bases: BaseModel
ExperimentInput ¶
Bases: BaseModel
Input for sourcing pre-recorded responses from an Orq experiment.
Used with inference=False to re-run evaluators against the responses an earlier experiment already produced, without regenerating them.
experiment_id instance-attribute ¶
The experiment ID to load responses from. Read it off the experiment URL in the Orq UI (/experiments/<experiment_id>). The API refers to experiments as "spreadsheets", so you will also see this ID in /v2/spreadsheets/<id> routes.
run_id = None class-attribute instance-attribute ¶
A specific run ID (a "manifest" in the API). When omitted, the latest run is used. Every execution of an experiment creates a new run; open it from the experiment's run history to read its ID from the URL.
JobResult ¶
Bases: BaseModel
JobReturn ¶
Bases: TypedDict
Job return structure
MessageDict ¶
Bases: TypedDict
Chat message structure compatible with Orq SDK.
ScorerParameter ¶
Bases: TypedDict
Parameters passed to a scorer function Args: data: The data point being evaluated. output: The output produced by the job for the data point.
ThreadConfig ¶
Bases: TypedDict
Thread configuration for conversation tracking.
exact_match_evaluator(*, case_insensitive=False, name='exact-match') ¶
Creates an evaluator that checks if the output exactly matches the expected output. Uses the data.expected_output from the dataset to compare against.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
case_insensitive | bool | Whether the comparison should be case-insensitive | False |
name | str | Optional name for the evaluator | 'exact-match' |
Returns:
| Type | Description |
|---|---|
Evaluator | An Evaluator that checks if output exactly matches expected output |
invoke(key, inputs=None, context=None, metadata=None, thread=None, messages=None) async ¶
Invoke an Orq deployment and return just the text content. This is a convenience wrapper around deployment() for simple use cases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key | str | The deployment key (name) | required |
inputs | dict[str, object] | None | Input variables for the deployment template | None |
context | dict[str, object] | None | Context attributes for routing | None |
metadata | dict[str, object] | None | Metadata to attach to the request | None |
thread | ThreadConfig | None | Thread configuration for conversation tracking. Must include 'id' key. | None |
messages | list[MessageDict] | None | Chat messages for conversational deployments | None |
Returns:
| Type | Description |
|---|---|
str | The text content of the response |
Example
In a job¶
@job("my-job") async def my_job(data, row): return await invoke("summarizer", inputs=data.inputs)
job(name, fn=None) ¶
Helper function/decorator to create a named job that ensures the job name is preserved even when errors occur during execution.
This wrapper: - Automatically formats the return value as {"name": ..., "output": ...} - Attaches the job name to errors for better error tracking - Can be used as a decorator (@job("name")) or function (job("name", fn))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | The name of the job | required |
fn | Callable[[DataPoint, int], Awaitable[Output] | Output] | None | The job function that returns the output (optional when used as decorator) | None |
Returns:
| Type | Description |
|---|---|
Job | Callable[[Callable[[DataPoint, int], Awaitable[Output] | Output]], Job] | A Job function that always includes the job name |
Example
# As a decorator:
@job("text-analyzer")
async def analyze_text(data: DataPoint, row: int):
return {"length": len(data.inputs["text"])}
# As a function wrapper:
my_job = job("my-job", async_function)
# With lambda for simple cases:
uppercase_job = job("uppercase", lambda data, row: data.inputs["text"].upper())
string_contains_evaluator(case_insensitive=True, name='string-contains') ¶
Creates an evaluator that checks if the output contains the expected output. Uses the data.expected_output from the dataset to compare against.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
case_insensitive | bool | Whether the comparison should be case-insensitive | True |
name | str | Optional name for the evaluator | 'string-contains' |
Returns:
| Type | Description |
|---|---|
Evaluator | An Evaluator that checks if output contains expected output |