Overview
Once you’ve created the test suite, you can run it with therun() function. This will run all the tests in the suite against your model. We support three different ways to produce outputs as you run the suite:
- Stock Model: We have as set of models on our platform that you can use, from the likes of OpenAI, Anthropic, Meta, etc.
- Function: You can provide us a function that takes in the input to the model (and optionally, files and context) and returns the output of your custom model.
- Provide Outputs: You can provide us a list of pairs of inputs and outputs, we will run the evaluation against these outputs directly.
1. Running with Stock Model
The below code will evaluate howgpt-4o-mini performs on the tests you’ve defined.
2. Running with Custom Function
Basic Function
You can also provide a custom function - this can contain any RAG pipelines, prompt chains, agentic behavior, etc. For example, here’s a naive model that produces output in the style of a pirate.Function with Context and Files
If you’re using the context and files, you probably want them available to your model function. You can do this by adding them as parameters to your function.3. Provide Outputs Directly
You can also provide us a list of the outputs you want to evaluate against. This is useful if you’ve already generated the outputs in some form.
NOTE: If you are using this method, the input_under_test field in the QuestionAnswerPair must match the input_under_test field in the test suite. Likewise, if you are using either the context or file features, both the context and files must also match.
4. Provide Custom Operators
You can pass in custom operators to evaluate model outputs using your own criteria.Run Options
There are other parameters you can pass to therun() function to control its behavior, in addition to the model parameter. If you set wait_for_completion=True, the function will block until the run is complete (by default, it will return as soon as the run is started, not when it’s complete). You can also pass a run_name parameter to uniquely identify the run - this is useful if you’re starting many runs of the same test suite, and need a way to disambiguated them.
Finally, you can also pass a RunParameters object to the run() function to control more aspects of the run. Some options include:
eval_model: The model to use as the LLM as judgeparallelism: The number of tests to run at Onceheavyweight_factor: Run the auto eval multiple times and take the mode of the resultsmax_output_tokens: If using the first model option above, control the max_output_tokens. Ignored if outputs are provided directly or using a function.system_prompt: If using the first model option above, provide a system prompt to the model.except_on_error: Will raise an exception if the run fails.custom_parameters: Custom parameters to pass to the model. This will be shown in the run result page, even when running with function.
Examples
After a Run is Complete
Once a run is complete, you access the results in theRun object. You can access the results of each test in the test_results property, as well as the top-line pass rate, the URL, and other information.