Setup
Make sure you have the Vals Python Package installedCreating a Test Suite with the SDK
In the SDK, every construct is generally represented as a Python object (constructed with Pydantic). To create a test suite, you can first create aSuite
object,
then call create()
. For example:
Note on Async: All SDK functions are asynchronous, so you will need to call them from an asynchronous context. See the async docs for more information.
Tests with files
Our system also supports testing files as input. For example, you may want to test a model’s ability to answer questions about a contract, or extract information from an image. To add files to a test, you can do the following:Adding Context
We also support adding arbitrary information to the input of each test, in addition to the input and the files. For example, you may want to provide a chat history to the model, provide information about the user who asked the question, specify where in an application the question was asked, etc. You can provide this with thecontext
parameter of the Test
:
NOTE: Context field values can be either raw strings or JSON objects. If it is a JSON object, it will be parsed correctly and pretty-printed in the UI.
Adding Tags
You can also add tags to a test. These tags are searchable in the test suite and run result, and you can see a performance breakdown by tag.Adding Global Checks
If you want certain checks to be run on every test, you can add them to the suite with theglobal_checks
parameter. For example, this is how you would check the grammar of every test by default.
Advanced Check Modifiers
Each check has a set of modifiers that can be used to change its behavior:- severity: Allows you to weight some checks higher than others
- examples: Allows you to provide in-context examples of outputs that should pass or fail
- extractor: Allows you to extract items from the output before the check is evaluated
- conditional: Allows you to only run the check if another check evaluates to true
- category: Allows you to override the default category of the check (correctness, formatting, etc.). This is also similar to tags, but allows you to do it on a more granular level.
Downloading / Pulling a Test Suite
If a test suite is already in the platform, you can pull it locally to edit or save it. Just copy the suite ID from the test suite page (or from the last portion of the test suite URL). Then callSuite.from_id
:
Updating a Test Suite
You can also update the test suite that you have locally. For example, let’s say you want to change the global checks of a suite. You can do this as follows:Running a Test Suite
Once you’ve created the test suite, you can run it with therun()
function. This will run all the tests in the suite
against your model. We support three different ways to produce outputs as you run the suite:
- Stock Model: We have as set of models on our platform that you can use, from the likes of OpenAI, Anthropic, Meta, etc.
- Function: You can provide us a function that takes in the input to the model (and optionally, files and context) and returns the output of your custom model.
- Provide Outputs: You can provide us a list of pairs of inputs and outputs, we will run the evaluation against these outputs directly.
1. Running with stock model
The below code will evaluate howgpt-4o-mini
performs on the tests you’ve defined.
2a. Running with function (basic)
You can also provide a custom function - this can contain any RAG pipelines, prompt chains, agentic behavior, etc. For example, here’s a naive model that produces output in the style of a pirate.2b. Running with function (with context and files)
If you’re using the context and files, you probably want them available to your model function. You can do this by adding them as parameters to your function.3. Provide outputs directly
You can also provide us a list of the outputs you want to evaluate against. This is useful if you’ve already generated the outputs in some form.
NOTE: If you are using this method, the input_under_test_ field in the QuestionAnswerPair
must match the input_under_test field in the test suite. Likewise, if you are using either
the context or file features, both the context and files must also match.
4. Provide custom operators
You can pass in custom operators to evaluate model outputs using your own criteria.Other Run Options
There are other parameters you can pass to therun()
function to control its behavior, in addition to the model
parameter. If you set wait_for_completion=True
, the function will block until the run is complete (by default, it will return as soon as the run is started, not when it’s complete). You can also pass a run_name
parameter to uniquely identify the run - this is useful if you’re starting many runs of the same test suite, and need a way to disambiguated them.
Finally, you can also pass a RunParameters
object to the run()
function to control more aspects of the run. Some options include:
eval_model
: The model to use as the LLM as judgeparallelism
: The number of tests to run at Onceheavyweight_factor
: Run the auto eval multiple times and take the mode of the resultsmax_output_tokens
: If using the first model option above, control the max_output_tokens. Ignored if outputs are provided directly or using a function.system_prompt
: If using the first model option above, provide a system prompt to the model.except_on_error
: Will raise an exception if the run fails.custom_parameters
: Custom parameters to pass to the model. This will be shown in the run result page, even when running with function.
After a run is complete
Once a run is complete, you access the results in theRun
object. You can access the results of each test in the test_results
property, as well as the top-line pass rate, the URL, and other information.
Human Review
The Human Review system allows you to queue test runs for manual evaluation by human reviewers. This provides a way to validate model outputs beyond automated checks.Adding a Run to Review Queue
Queue a run for human review using theadd_to_queue()
method:
Add to Queue
Get Review
assigned_reviewers
- List of reviewer email addresses (empty list allows any reviewer)number_of_reviews
- How many reviewers will evaluate each test (default: 1)rereview_auto_eval
- Whether to re-run auto-evaluation after reviews (default: True)
Working with Reviews
Once a run is queued, you can access the review through therun.review
cached property:
Access Review Properties
Modify Review Queue
id
- Same asrun.review_id
status
- Current review status (Pending, Archived, or Completed)pass_rate_human_eval
- Pass rate across all human reviewsagreement_rate_human_eval
- Agreement rate between human reviewerstest_results
- List of completed test results (cached property, requiresawait
)
Working with Test Results
Access individual test result reviews to get detailed feedback:Access Test Reviews
Access Review Values
reviewed_by
- List of reviewer email addressesreviews
- List of all reviews for this testtest
- The original test being reviewedcheck_results
- Auto-evaluated check results
feedback
- Optional reviewer feedbackcompleted_by
- Reviewer who completed this reviewcompleted_at
/started_at
- Timestampsauto_eval_review_values
- Human validation of auto-evaluationscustom_review_values
- Custom template review data
Loading a Test Suite from a file
Although it’s preferred to create a Test Suite with python objects, the test suite can also be loaded from a local JSON file. To create a test suite suite from a file, you can use theSuite.from_file()
function.
Using “Right Answers”
To add tests with right answers, just use thegolden_answer
field in the test. A full example is as follows: