Setup

Make sure you have the Vals Python Package installed

pip install valsai

Then make sure to create an API key and set your environment variable. This guide assumes you are familiar with the basic concepts of Test Suites, Tests, Checks, etc. If not, see the Test Suite Creation Page.

Creating a Test Suite with the SDK

In the SDK, every construct is generally represented as a Python object (constructed with Pydantic). To create a test suite, you can first create a Suite object, then call create(). For example:

from vals.sdk.v2.suite import Suite
from vals.sdk.v2.types import Test, Check

suite = Suite(
    title="Test Suite",
    description="This is an example test suite.",
    tests=[
        Test(
            input_under_test="What is QSBS?",
            checks=[
                Check(operator="equals", criteria="QSBS")
            ]
        )
    ],
)
await suite.create()

print("Url: ", suite.url)

This creates a simple test suite with a single test.

Note on Async: All SDK functions are asynchronous, so you will need to call them from an asynchronous context. See the async docs for more information.

Tests with files

Our system also supports testing files as input. For example, you may want to test a model’s ability to answer questions about a contract, or extract information from an image. To add files to a test, you can do the following:

from vals.sdk.v2.suite import Suite, Test, Check

suite = Suite(
    title = "My Suite with files",
    tests = [
        Test(
            input_under_test="Is there an MFN clause in this contract?",
            files_under_test=["path/to/file.docx"],
            checks=[Check(operator="equals", criteria="No")]
        )
    ]
)

Both the model and the operators will have access to the file content.

Adding Context

We also support adding arbitrary information to the input of each test, in addition to the input and the files. For example, you may want to provide a chat history to the model, provide information about the user who asked the question, specify where in an application the question was asked, etc.

You can provide this with the context parameter of the Test:

from vals.sdk.v2.suite import Suite, Test, Check

suite = Suite(
    title = "My Suite with context",
    tests = [
        Test(
            input_under_test="What is the MFN clause?",
            context={
                "user_email": "john.doe@example.com",
                "message_history": [
                    {"role": "user", "content": "What can you help me with?"},
                    {"role": "assistant", "content": "I can help you with answering legal questions about contracts."},
                ]
            },
            checks=[Check(operator="equals", criteria="No")]
        )
    ]
)

Adding Tags

You can also add tags to a test. These tags are searchable in the test suite and run result, and you can see a performance breakdown by tag.

Test(
    input_under_test="What is the MFN clause?",
    tags=["contract", "mfn"],
    checks=[Check(operator="grammar")]
)

Adding Global Checks

If you want certain checks to be run on every test, you can add them to the suite with the global_checks parameter. For example, this is how you would check the grammar of every test by default.

suite = Suite(
    title="My Suite with global checks",
    global_checks=[
        Check(operator="grammar")
    ],
    tests=[...],
)

Advanced Check Modifiers

Each check has a set of modifiers that can be used to change its behavior:

  • optional: The check is not counted towards the final pass rate
  • severity: Allows you to weight some checks higher than others
  • examples: Allows you to provide in-context examples of outputs that should pass or fail
  • extractor: Allows you to extract items from the output before the check is evaluated
  • conditional: Allows you to only run the check if another check evaluates to true
  • category: Allows you to override the default category of the check (correctness, formatting, etc.). This is also similar to tags, but allows you to do it on a more granular level.
Check(
    operator = "grammar",
    modifiers=CheckModifiers(
        # The check should not count towards the final pass rate
        optional=True,
        # Weight three times as important as other checks
        severity=3,
        # In-context examples of outputs that should pass or fail
        examples=[
            Example(type="positive", text="This is an example of good grammar.")
        ],
        # Only evaluate part of the output
        extractor="Extract only the first paragraph",
        # Only run this check if the below passes
        conditional=ConditionalCheck(operator="...", criteria="..."),
        # Override the category 
        category="writing_quality"
    )
)

Downloading / Pulling a Test Suite

If a test suite is already in the platform, you can pull it locally to edit or save it. Just copy the suite ID from the test suite page (or from the last portion of the test suite URL). Then call Suite.from_id:

suite = await Suite.from_id("12345678-abcd-efgh-1234-0123456789")

Updating a Test Suite

You can also update the test suite that you have locally. For example, let’s say you want to change the global checks of a suite. You can do this as follows:

# Download suite locally
suite = await Suite.from_id("12345678-abcd-efgh-1234-0123456789")

# Update the suite
suite.global_checks = [Check(operator="grammar")]
await suite.update()

Running a Test Suite

Once you’ve created the test suite, you can run it with the run() function. This will run all the tests in the suite against your model. We support three different ways to produce outputs as you run the suite:

  1. Stock Model: We have as set of models on our platform that you can use, from the likes of OpenAI, Anthropic, etc.
  2. Function: You can provide us a function that takes in the input to the model (and optionally, files and context) and returns the output of your custom model.
  3. Provide Outputs: You can provide us a list pairs of inputs and outputs, we will run the evaluation against these outputs directly.

1. Running with stock model

The below code will evaluate how gpt-4o-mini performs on the tests you’ve defined.

# create a suite using the steps above

# run the suite with a stock model
run = await suite.run(model="openai/gpt-4o-mini")
print(f"Run URL: {run.url}")

2a. Running with function (basic)

You can also provide a custom function - this can contain any RAG pipelines, prompt chains, agentic behavior, etc. For example, here’s a naive model that produces output in the style of a pirate.

gpt_client = OpenAI(api_key=os.environ.get("OPEN_AI_KEY"))

def model_func(test_input: str):
    """ Arbitrary function to represent your 'model' - including RAG pipelines, prompt chains, etc. """
    prompt = "You are a pirate, answer in the speaking style of a pirate.\n\n"
    temp = 0.2

    response = gpt_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt + test_input}],
        temperature=temp,
    )
    return response.choices[0].message.content

# Run the model against the suite, and provide a name for the model for record keeping.
suite.run(model=model_func, model_name="pirate-model-v1")

2b. Running with function (with context and files)

If you’re using the context and files, you probably want them available to your model function. You can do this by adding them as parameters to your function.

from vals.sdk.sdk import read_pdf 

def model_func(test_input: str, files: dict[str, BytesIO], context: dict[str, Any]):
    # Access context (e.g. message history)
    message_history = context["message_history"]


    # Access files
    for filename, file_content in files.items():
        file_text = read_pdf(file_content) # Or use binary directly
    
    # Query your model 

3. Provide outputs directly

You can also provide us a list of the outputs you want to evaluate against. This is useful if you’ve precomputed the outputs in some way.

qa_pairs = [
    QuestionAnswerPair(
        input_under_test="What is the MFN clause?",
        llm_output="The MFN clause is a clause in a contract that allows one party to modify the terms of the contract without the other party's consent.",
    )
]


run = await suite.run(
    model=qa_pairs, model_name="precomputed-outputs"
)

NOTE: If you are using this method, the input_under_test_ field in the QuestionAnswerPair must match the input_under_test field in the test suite. Likewise, if you are using either the context or file features, both the context and files must also match.

Other Run Options

There are other parameters you can pass to the run() function to control its behavior, in addition to the model parameter. If you set wait_for_completion=True, the function will block until the run is complete (by default, it will return as soon as the run is started, not when it’s complete). You can also pass a run_name parameter to uniquely identify the run - this is useful if you’re starting many runs of the same test suite, and need a way to disambiguated them.

Finally, you can also pass a RunParameters object to the run() function to control more aspects of the run. Some options include:

  • eval_model: The model to use as the LLM as judge
  • parallelism: The number of tests to run at Once
  • heavyweight_factor: Run the auto eval multiple times and take the mode of the results
  • max_output_tokens: If using the first model option above, control the max_output_tokens. Ignored if outputs are provided directly or using a function.
  • system_prompt: If using the first model option above, provide a system prompt to the model.

Here’s an example:

run = await suite.run(
    model="openai/gpt-4o-mini",
    wait_for_completion=True,
    run_name="my-run-1",
    parameters=RunParameters(
        eval_model="openai/gpt-4o-mini",
        parallelism=10,
        heavyweight_factor=3,
    )
)

After a run is complete

Once a run is complete, you access the results in the Run object. You can access the results of each test in the test_results property, as well as the top-line pass rate, the URL, and other information.

run = await Suite.run(...)

print(f"Status: {run.status}")
print(f"Run URL: {run.url}")
print(f"Pass rate: {run.pass_rate}")
print(f"Timestamp: {run.timestamp}")
print(f"Completed at: {run.completed_at}")


for i, test_result in enumerate(run.test_results):
    print(f"Test {i} Input: {test_result.input_under_test}")
    print(f"Test {i} Output: {test_result.llm_output}")
    # Can also access checks, context, files, auto eval, etc.

Loading a Test Suite from a file

Although it’s preferred to create a Test Suite with python objects, the test suite can also be loaded from a local JSON file.

To create a test suite suite from a file, you can use the Suite.from_file() function.

suite = await Suite.from_file("path/to/test_suite.json")

Here is an example of what the test suite file looks like:

{
    "title": "My Test Suite",
    "description": "This is an example test suite.",
    "global_checks": [{"operator": "grammar"}],
    "tests": [
        {
            "input_under_test": "What is QSBS?",
            "checks": [{"operator": "includes", "criteria": "C Corporation"}]
        },
        {
            "input_under_test": "Does this contract have a MFN clause?",
            "context": {"user_email": "john.doe@example.com"},
            "files_under_test": ["path/to/file.docx"],
            "checks": [{"operator": "equals", "criteria": "No"}]
        }
    ]
}

Using Golden Outputs

In addition to checks, the SDK also supports golden outputs - the notion of a “right answer” for each input. Here’s an example:

suite = Suite(
    title="My Test Suite with Golden Outputs",
    tests=[{
        "input_under_test": "What is QSBS?",
        "checks": [],
        "golden_output": "QSBS stands for Qualified Small Business Stock, a designation in the U.S. tax code (Section 1202) that offers tax advantages to investors who hold eligible small business stock. If an individual holds QSBS for more than five years, they may be able to exclude up to 100% of the gains from the sale of the stock, subject to certain limits and qualifications. This tax incentive aims to encourage investments in small, innovative businesses. However, the stock must meet specific criteria, including being issued by a C-corporation in certain industries and having gross assets below $50 million when the stock was issued."
    }],
)

run = await suite.run(
    model="openai/gpt-4o-mini", 
    wait_for_completion=True, 
    parameters=RunParameters(run_golden_eval=True)
)
print("Run URL: ", run.url)