Using Vals Platform from your Code
Setup
Make sure you have the Vals Python Package installed
Then make sure to create an API key and set your environment variable. This guide assumes you are familiar with the basic concepts of Test Suites, Tests, Checks, etc. If not, see the Test Suite Creation Page.
Creating a Test Suite with the SDK
In the SDK, every construct is generally represented as a Python object (constructed with Pydantic). To create a test suite, you can first create a Suite
object,
then call create()
. For example:
This creates a simple test suite with a single test.
Note on Async: All SDK functions are asynchronous, so you will need to call them from an asynchronous context. See the async docs for more information.
Tests with files
Our system also supports testing files as input. For example, you may want to test a model’s ability to answer questions about a contract, or extract information from an image. To add files to a test, you can do the following:
Both the model and the operators will have access to the file content.
Adding Context
We also support adding arbitrary information to the input of each test, in addition to the input and the files. For example, you may want to provide a chat history to the model, provide information about the user who asked the question, specify where in an application the question was asked, etc.
You can provide this with the context
parameter of the Test
:
Adding Tags
You can also add tags to a test. These tags are searchable in the test suite and run result, and you can see a performance breakdown by tag.
Adding Global Checks
If you want certain checks to be run on every test, you can add them to the suite with the global_checks
parameter. For example, this is how you would check the grammar of every test by default.
Advanced Check Modifiers
Each check has a set of modifiers that can be used to change its behavior:
- optional: The check is not counted towards the final pass rate
- severity: Allows you to weight some checks higher than others
- examples: Allows you to provide in-context examples of outputs that should pass or fail
- extractor: Allows you to extract items from the output before the check is evaluated
- conditional: Allows you to only run the check if another check evaluates to true
- category: Allows you to override the default category of the check (correctness, formatting, etc.). This is also similar to tags, but allows you to do it on a more granular level.
Downloading / Pulling a Test Suite
If a test suite is already in the platform, you can pull it locally to edit or save it. Just copy the
suite ID from the test suite page (or from the last portion of the test suite URL). Then call Suite.from_id
:
Updating a Test Suite
You can also update the test suite that you have locally. For example, let’s say you want to change the global checks of a suite. You can do this as follows:
Running a Test Suite
Once you’ve created the test suite, you can run it with the run()
function. This will run all the tests in the suite
against your model. We support three different ways to produce outputs as you run the suite:
- Stock Model: We have as set of models on our platform that you can use, from the likes of OpenAI, Anthropic, etc.
- Function: You can provide us a function that takes in the input to the model (and optionally, files and context) and returns the output of your custom model.
- Provide Outputs: You can provide us a list pairs of inputs and outputs, we will run the evaluation against these outputs directly.
1. Running with stock model
The below code will evaluate how gpt-4o-mini
performs on the tests you’ve defined.
2a. Running with function (basic)
You can also provide a custom function - this can contain any RAG pipelines, prompt chains, agentic behavior, etc. For example, here’s a naive model that produces output in the style of a pirate.
2b. Running with function (with context and files)
If you’re using the context and files, you probably want them available to your model function. You can do this by adding them as parameters to your function.
3. Provide outputs directly
You can also provide us a list of the outputs you want to evaluate against. This is useful if you’ve precomputed the outputs in some way.
NOTE: If you are using this method, the input_under_test_ field in the
QuestionAnswerPair
must match the input_under_test field in the test suite. Likewise, if you are using either the context or file features, both the context and files must also match.
Other Run Options
There are other parameters you can pass to the run()
function to control its behavior, in addition to the model
parameter. If you set wait_for_completion=True
, the function will block until the run is complete (by default, it will return as soon as the run is started, not when it’s complete). You can also pass a run_name
parameter to uniquely identify the run - this is useful if you’re starting many runs of the same test suite, and need a way to disambiguated them.
Finally, you can also pass a RunParameters
object to the run()
function to control more aspects of the run. Some options include:
eval_model
: The model to use as the LLM as judgeparallelism
: The number of tests to run at Onceheavyweight_factor
: Run the auto eval multiple times and take the mode of the resultsmax_output_tokens
: If using the first model option above, control the max_output_tokens. Ignored if outputs are provided directly or using a function.system_prompt
: If using the first model option above, provide a system prompt to the model.
Here’s an example:
After a run is complete
Once a run is complete, you access the results in the Run
object. You can access the results of each test in the test_results
property, as well as the top-line pass rate, the URL, and other information.
Loading a Test Suite from a file
Although it’s preferred to create a Test Suite with python objects, the test suite can also be loaded from a local JSON file.
To create a test suite suite from a file, you can use the Suite.from_file()
function.
Here is an example of what the test suite file looks like:
Using Golden Outputs
In addition to checks, the SDK also supports golden outputs - the notion of a “right answer” for each input. Here’s an example: