Once you have created a test suite, you will want to use it to actually test your model. There are a few ways to do this.
If you want to run a test suite against a stock model (e.g. base gpt-4o), you can do so by clicking the “Start Run” button on the upper-right of the test-suite page.
The first set of options allows you to control how the evaluation works.
The second set of options controls the model that is being tested. The dropdown allows you to choose the model that is used to produce the outputs. You can also control the additional parameters for the model.
When ran, it will pull outputs from the model you choose - then run the checks against the outputs.
Note: By default, we only make a subset of models available to users. If you’re interested in using additional models, including those hosted on Bedrock and Azure, please reach out to the Vals team.
If you have the outputs from your model collected already, and just want to run the tests against them, you can do so by choosing the “Upload CSV” option.
The CSV should have two columns: Input
and Output
. The Input column should match the inputs you have in your test suite. The Output
column should have the output
of your model on these questions.
The system will then run the checks you’ve defined against the outputs in the CSV. Optionally, you can provide a “model name” to help you specify what model was used to produce the outputs.
Note: If you are using context or files in your test suite, uploading an Input/Output CSV may not be an option.
If you have the model defined locally in a python script, you can use the SDK to produce the outputs. See the SDK docs for more information.
There are a few parameters that you can set to control how the run works.
Evaluation Parameters:
includes
. By default, we do evaluation with GPT-4o. However, using this dropdown, you can also choose to use another model for evaluation (such as Mistral, Llama, or Claude).Model Parameters:
You can also provide a “name” for the run - this is useful to keep track between many different runs.
Once you have created a test suite, you will want to use it to actually test your model. There are a few ways to do this.
If you want to run a test suite against a stock model (e.g. base gpt-4o), you can do so by clicking the “Start Run” button on the upper-right of the test-suite page.
The first set of options allows you to control how the evaluation works.
The second set of options controls the model that is being tested. The dropdown allows you to choose the model that is used to produce the outputs. You can also control the additional parameters for the model.
When ran, it will pull outputs from the model you choose - then run the checks against the outputs.
Note: By default, we only make a subset of models available to users. If you’re interested in using additional models, including those hosted on Bedrock and Azure, please reach out to the Vals team.
If you have the outputs from your model collected already, and just want to run the tests against them, you can do so by choosing the “Upload CSV” option.
The CSV should have two columns: Input
and Output
. The Input column should match the inputs you have in your test suite. The Output
column should have the output
of your model on these questions.
The system will then run the checks you’ve defined against the outputs in the CSV. Optionally, you can provide a “model name” to help you specify what model was used to produce the outputs.
Note: If you are using context or files in your test suite, uploading an Input/Output CSV may not be an option.
If you have the model defined locally in a python script, you can use the SDK to produce the outputs. See the SDK docs for more information.
There are a few parameters that you can set to control how the run works.
Evaluation Parameters:
includes
. By default, we do evaluation with GPT-4o. However, using this dropdown, you can also choose to use another model for evaluation (such as Mistral, Llama, or Claude).Model Parameters:
You can also provide a “name” for the run - this is useful to keep track between many different runs.