Once you have created a test suite, you will want to use it to actually test your model. There are a few ways to do this.
You can provide a “name” for the run - this is useful to keep track between many different runs.
If you want to run a test suite against a stock model (e.g. base gpt-4o), you can do so by clicking the “Start Run” button on the upper-right of the test-suite page.
When run, it will generate outputs from the model you choose, and then evaluate checks.
Note: By default, we only make a subset of models available to users. If you’re interested in using additional models, including those hosted on Bedrock and Azure, please reach out to the Vals team.
Model Parameters:
If you have the outputs from your model collected already, and just want to run the tests against them, you can do so by choosing the “Upload CSV” option.
The CSV should have columns Question
Answer
. The Question column should match the inputs you have in your test suite. The Answer
column should have the output of your model on these questions. A way to do so is to “Export test questions” and then modify the csv.
The system will then run the checks you’ve defined against the outputs in the CSV. Optionally, you can provide a “model name” to help you specify what model was used to produce the outputs.
Note: If you are using context or files in your test suite, uploading an Input/Output CSV may not be an option.
There are a few parameters that you can set to control how the run works.
Evaluation Parameters:
includes
. By default, we evaluate with GPT-4o. However, using this dropdown, you can also choose another model for evaluation (such as Mistral, Llama, or Claude).DEFAULT
: Compute and display a confidence score (high or low) for each check.DEFAULT
: Create summary of the run.DEFAULT
: Run right answer comparison.If you have the model defined locally in a python script, you can use the SDK to produce the outputs. See the SDK docs for more information.