Getting Started¶
Your first evaluation in five minutes. No dataset, no deployment — just local data and a local scorer.
Install¶
The mental model¶
An evaluation has three parts:
DataPoint— one row of input plus itsexpected_output.@job— an async function that turns aDataPointinto an output (your model call, agent, or — here — a trivial transform).- Evaluator — a scorer that compares the output against the expectation and returns pass/fail.
evaluatorq(...) runs every job over every datapoint in parallel and applies each evaluator to the results.
flowchart LR
D["DataPoint"]
J["@job"]
O["output"]
E["Evaluator"]
P["pass / fail"]
D --> J --> O --> E --> P A first evaluation¶
import asyncio
from evaluatorq import DataPoint, evaluatorq, job, string_contains_evaluator
@job("uppercase-converter")
async def uppercase_job(data: DataPoint, _row: int) -> str:
return str(data.inputs.get("text", "")).upper()
async def run():
data = [
DataPoint(inputs={"text": "hello world"}, expected_output="HELLO"),
DataPoint(inputs={"text": "python is great"}, expected_output="PYTHON"),
DataPoint(inputs={"text": "evaluatorq rocks"}, expected_output="EVALUATORQ"),
]
return await evaluatorq(
"simple-local-eval",
data=data,
jobs=[uppercase_job],
evaluators=[string_contains_evaluator()],
parallelism=3,
print_results=True,
)
if __name__ == "__main__":
asyncio.run(run())
Run it:
print_results=True renders a pass/fail table in the terminal. In this example, string_contains_evaluator() checks whether the job output contains the expected_output, so HELLO WORLD satisfies an expected output of HELLO. Wire that pass/fail signal into CI to gate on quality regressions.
Where to next¶
- Agent Simulation — score multi-turn conversations.
- Red Teaming — adversarial security testing.
- Configuration — API keys and environment variables for Orq/OpenAI backends.
- Examples — datasets, structured scoring, integrations.
- API Reference —
evaluatorq,DataPoint,job, evaluators.