Running Test Suites

Once you have created a test suite, you will want to use it to actually test your model. There are a few ways to do this. You can provide a “name” for the run - this is useful to keep track between many different runs.

Running against a base LLM

If you want to run a test suite against a stock model (e.g. base gpt-4o), you can do so by clicking the “Start Run” button on the upper-right of the test-suite page.

When run, it will generate outputs from the model you choose, and then evaluate checks.

Note: By default, we only make a subset of models available to users. If you’re interested in using additional models, including those hosted on Bedrock and Azure, please reach out to the Vals team.

Model Parameters:

Model: The model that is used to produce the outputs
Temperature: The temperature of the model being tested
Max Tokens: The maximum number of tokens the model being tested can output
System Prompt: The system prompt to be passed to the model being tested

Run by uploading custom Question/Answer CSV

If you have the outputs from your model collected already, and just want to run the tests against them, you can do so by choosing the “Upload CSV” option.

You can find information on file format here, Question Answer Pair File Handling The system will then run the checks you’ve defined against the outputs in the CSV. Optionally, you can provide a “model name” to help you specify what model was used to produce the outputs.

Parameters

There are a few parameters that you can set to control how the run works.

Evaluation Parameters:

Evaluation Model: As part of our evaluation suites, we use LLMs under the hood for checks like includes. By default, we evaluate with GPT-4o. However, using this dropdown, you can also choose another model for evaluation (such as Mistral, Llama, or Claude).
Parallelism: This controls how many tests are running at any run time. Set it to a higher number to make the run finish faster. Set it to a lower number depending on rate limits.
Heavyweight Factor: Runs each check multiple times to reduce variance in the results. For example, if it was set to 5, each check would be run 5 times, and the final result would be the mode.
Retry Failed Calls Indefinitely: Retry model calls until they succeed.
Run Confidence Evaluation DEFAULT: Compute and display a confidence score (high or low) for each check.
Create Text Summary DEFAULT: Create summary of the run.
Run Right Answer Comparison DEFAULT: Run right answer comparison.
Run Refusal Detection: Compute and display refusal rate (model refused to respond).
Use Batch API: Use (if supported) the provider’s batch API to run model calls.

Run using the SDK

If you have the model defined locally in a python script, you can use the SDK to produce the outputs. See the SDK docs for more information.

Get Started

Web App

CLI and SDK

Running Test Suites

Running against a base LLM

Run by uploading custom Question/Answer CSV

Parameters

Run using the SDK

Get Started

Web App

CLI and SDK

​Running against a base LLM

​Run by uploading custom Question/Answer CSV

​Parameters

​Run using the SDK

Running against a base LLM

Run by uploading custom Question/Answer CSV

Parameters

Run using the SDK