Using Vals Platform from your Code

Setup

Make sure you have the Vals Python Package installed

pip install valsai

Then make sure to create an API key and set your environment variable. This guide assumes you are familiar with the basic concepts of Test Suites, Tests, Checks, etc. If not, see the Test Suite Creation Page.

Creating a Test Suite with the SDK

In the SDK, every construct is generally represented as a Python object (constructed with Pydantic). To create a test suite, you can first create a Suite object, then call create(). For example:

from vals import Suite, Test, Check

async def create_suite():
    suite = Suite(
        title="Test Suite",
        description="This is an example test suite.",
        tests=[
            Test(
                input_under_test="What is QSBS?",
                checks=[
                    Check(operator="equals", criteria="QSBS")
                ]
            )
        ],
    )
    await suite.create()

    print("Url: ", suite.url)

This creates a simple test suite with a single test.

Note on Async: All SDK functions are asynchronous, so you will need to call them from an asynchronous context. See the async docs for more information.

Tests with files

Our system also supports testing files as input. For example, you may want to test a model’s ability to answer questions about a contract, or extract information from an image. To add files to a test, you can do the following:

from vals import Suite, Test, Check

suite = Suite(
    title = "My Suite with files",
    tests = [
        Test(
            input_under_test="Is there an MFN clause in this contract?",
            files_under_test=["path/to/file.docx"],
            checks=[Check(operator="equals", criteria="No")]
        )
    ]
)

Both the model and the operators will have access to the file content.

Adding Context

We also support adding arbitrary information to the input of each test, in addition to the input and the files. For example, you may want to provide a chat history to the model, provide information about the user who asked the question, specify where in an application the question was asked, etc. You can provide this with the context parameter of the Test:

from vals import Suite, Test, Check

suite = Suite(
    title = "My Suite with context",
    tests = [
        Test(
            input_under_test="What is the MFN clause?",
            context={
                "user_email": "john.doe@example.com",
                "message_history": [
                    {"role": "user", "content": "What can you help me with?"},
                    {"role": "assistant", "content": "I can help you with answering legal questions about contracts."},
                ]
            },
            checks=[Check(operator="equals", criteria="No")]
        )
    ]
)

NOTE: Context field values can be either raw strings or JSON objects. If it is a JSON object, it will be parsed correctly and pretty-printed in the UI.

Adding Tags

You can also add tags to a test. These tags are searchable in the test suite and run result, and you can see a performance breakdown by tag.

Test(
    input_under_test="What is the MFN clause?",
    tags=["contract", "mfn"],
    checks=[Check(operator="grammar")]
)

Adding Global Checks

If you want certain checks to be run on every test, you can add them to the suite with the global_checks parameter. For example, this is how you would check the grammar of every test by default.

suite = Suite(
    title="My Suite with global checks",
    global_checks=[
        Check(operator="grammar")
    ],
    tests=[...],
)

Advanced Check Modifiers

Each check has a set of modifiers that can be used to change its behavior:

severity: Allows you to weight some checks higher than others
examples: Allows you to provide in-context examples of outputs that should pass or fail
extractor: Allows you to extract items from the output before the check is evaluated
conditional: Allows you to only run the check if another check evaluates to true
category: Allows you to override the default category of the check (correctness, formatting, etc.). This is also similar to tags, but allows you to do it on a more granular level.

Check(
    operator = "grammar",
    modifiers=CheckModifiers(
        # Weight three times as important as other checks
        severity=3,
        # In-context examples of outputs that should pass or fail
        examples=[
            Example(type="positive", text="This is an example of good grammar.")
        ],
        # Only evaluate part of the output
        extractor="Extract only the first paragraph",
        # Only run this check if the below passes
        conditional=ConditionalCheck(operator="...", criteria="..."),
        # Override the category 
        category="writing_quality"
    )
)

See the modifiers page for more information.

Downloading / Pulling a Test Suite

If a test suite is already in the platform, you can pull it locally to edit or save it. Just copy the suite ID from the test suite page (or from the last portion of the test suite URL). Then call Suite.from_id:

suite = await Suite.from_id("12345678-abcd-efgh-1234-0123456789")

Updating a Test Suite

You can also update the test suite that you have locally. For example, let’s say you want to change the global checks of a suite. You can do this as follows:

# Download suite locally
suite = await Suite.from_id("12345678-abcd-efgh-1234-0123456789")

# Update the suite
suite.global_checks = [Check(operator="grammar")]
await suite.update()

Running a Test Suite

Once you’ve created the test suite, you can run it with the run() function. This will run all the tests in the suite against your model. We support three different ways to produce outputs as you run the suite:

Stock Model: We have as set of models on our platform that you can use, from the likes of OpenAI, Anthropic, Meta, etc.
Function: You can provide us a function that takes in the input to the model (and optionally, files and context) and returns the output of your custom model.
Provide Outputs: You can provide us a list of pairs of inputs and outputs, we will run the evaluation against these outputs directly.

1. Running with stock model

The below code will evaluate how gpt-4o-mini performs on the tests you’ve defined.

# create a suite using the steps above

# run the suite with a stock model
run = await suite.run(model="openai/gpt-4o-mini")
print(f"Run URL: {run.url}")

2a. Running with function (basic)

You can also provide a custom function - this can contain any RAG pipelines, prompt chains, agentic behavior, etc. For example, here’s a naive model that produces output in the style of a pirate.

gpt_client = OpenAI(api_key=os.environ.get("OPEN_AI_KEY"))

def model_func(test_input: str):
    """ Arbitrary function to represent your 'model' - including RAG pipelines, prompt chains, etc. """
    prompt = "You are a pirate, answer in the speaking style of a pirate.\n\n"
    temp = 0.2

    response = gpt_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt + test_input}],
        temperature=temp,
    )
    return response.choices[0].message.content

# Run the model against the suite, and provide a name for the model for record keeping.
run = await suite.run(model=model_func, model_name="pirate-model-v1")

2b. Running with function (with context and files)

If you’re using the context and files, you probably want them available to your model function. You can do this by adding them as parameters to your function.

from vals.sdk.util import read_pdf 

def model_func(test_input: str, files: dict[str, BytesIO], context: dict[str, Any]):
    # Access context (e.g. message history)
    message_history = context["message_history"]


    # Access files
    for filename, file_content in files.items():
        # NOTE: You can also use your own file parsing / OCR logic here
        file_text = read_pdf(file_content)
    
    # Query your model 
    llm_output = ... 

    return llm_output

3. Provide outputs directly

You can also provide us a list of the outputs you want to evaluate against. This is useful if you’ve already generated the outputs in some form.

qa_pairs = [
    QuestionAnswerPair(
        input_under_test="What is the MFN clause?",
        llm_output="The MFN clause is a clause in a contract that allows one party to modify the terms of the contract without the other party's consent.",
    )
]


run = await suite.run(
    model=qa_pairs, model_name="precomputed-outputs"
)

NOTE: If you are using this method, the input_under_test_ field in the QuestionAnswerPair must match the input_under_test field in the test suite. Likewise, if you are using either the context or file features, both the context and files must also match.

4. Provide custom operators

You can pass in custom operators to evaluate model outputs using your own criteria.

async def custom_operator(input: OperatorInput) -> OperatorOutput:
    """
    A simple custom operator that checks whether the model output contains the word 'yes'.
    """
    contains_yes = "yes" in input.model_output.lower()
    score = 1.0 if contains_yes else 0.0
    explanation = (
        "The output contains the word 'yes'." if contains_yes
        else "The output does not contain the word 'yes'."
    )

    return OperatorOutput(
        name="contains_yes_check",
        score=score,
        explanation=explanation
    )

async def custom_model(input: str) -> str:
    return "Yes! yes! yes!"

run = await suite.run(
    model=custom_model,
    custom_operators=[custom_operator, custom_operator2],
)

Other Run Options

There are other parameters you can pass to the run() function to control its behavior, in addition to the model parameter. If you set wait_for_completion=True, the function will block until the run is complete (by default, it will return as soon as the run is started, not when it’s complete). You can also pass a run_name parameter to uniquely identify the run - this is useful if you’re starting many runs of the same test suite, and need a way to disambiguated them. Finally, you can also pass a RunParameters object to the run() function to control more aspects of the run. Some options include:

eval_model: The model to use as the LLM as judge
parallelism: The number of tests to run at Once
heavyweight_factor: Run the auto eval multiple times and take the mode of the results
max_output_tokens: If using the first model option above, control the max_output_tokens. Ignored if outputs are provided directly or using a function.
system_prompt: If using the first model option above, provide a system prompt to the model.
except_on_error: Will raise an exception if the run fails.
custom_parameters: Custom parameters to pass to the model. This will be shown in the run result page, even when running with function.

Examples:

run = await suite.run(
    model="openai/gpt-4o-mini",
    wait_for_completion=True,
    run_name="my-run-1",
    parameters=RunParameters(
        eval_model="openai/gpt-4o-mini",
        parallelism=10,
        heavyweight_factor=3,
    )
)

run = await suite.run(
    model="openai/gpt-4o-mini",
    wait_for_completion=True,
    run_name="my-run-1",
    parameters=RunParameters(
        parallelism=10,
        max_output_tokens=2048,
        custom_parameters={"top_p": 0.5},
    ),
    except_on_error=True
)

After a run is complete

Once a run is complete, you access the results in the Run object. You can access the results of each test in the test_results property, as well as the top-line pass rate, the URL, and other information.

run = await Suite.run(...)

print(f"Status: {run.status}")
print(f"Run URL: {run.url}")
print(f"Pass rate: {run.pass_rate}")
print(f"Timestamp: {run.timestamp}")
print(f"Completed at: {run.completed_at}")


for i, test_result in enumerate(run.test_results):
    print(f"Test {i} Input: {test_result.input_under_test}")
    print(f"Test {i} Output: {test_result.llm_output}")
    # Can also access checks, context, files, auto eval, etc.

Loading a Test Suite from a file

Although it’s preferred to create a Test Suite with python objects, the test suite can also be loaded from a local JSON file. To create a test suite suite from a file, you can use the Suite.from_file() function.

suite = await Suite.from_file("path/to/test_suite.json")

Here is an example of what the test suite file looks like:

{
    "title": "My Test Suite",
    "description": "This is an example test suite.",
    "global_checks": [{"operator": "grammar"}],
    "tests": [
        {
            "input_under_test": "What is QSBS?",
            "checks": [{"operator": "includes", "criteria": "C Corporation"}]
        },
        {
            "input_under_test": "Does this contract have a MFN clause?",
            "context": {"user_email": "john.doe@example.com"},
            "files_under_test": ["path/to/file.docx"],
            "checks": [{"operator": "equals", "criteria": "No"}]
        }
    ]
}

Using “Right Answers”

To add tests with right answers, just use the golden_answer field in the test. A full example is as follows:

suite = Suite(
    title="My Test Suite with Golden Outputs",
    tests=[{
        "input_under_test": "What is QSBS?",
        "checks": [],
        "golden_output": "QSBS stands for Qualified Small Business Stock, a designation in the U.S. tax code (Section 1202) that offers tax advantages to investors who hold eligible small business stock. If an individual holds QSBS for more than five years, they may be able to exclude up to 100% of the gains from the sale of the stock, subject to certain limits and qualifications. This tax incentive aims to encourage investments in small, innovative businesses. However, the stock must meet specific criteria, including being issued by a C-corporation in certain industries and having gross assets below $50 million when the stock was issued."
    }],
)

run = await suite.run(
    model="openai/gpt-4o-mini", 
    wait_for_completion=True, 
    parameters=RunParameters()
)
print("Run URL: ", run.url)

See the web app for more information.

Get Started

Web App

CLI and SDK

Using Vals Platform from your Code

Setup

Creating a Test Suite with the SDK

Tests with files

Adding Context

Adding Tags

Adding Global Checks

Advanced Check Modifiers

Downloading / Pulling a Test Suite

Updating a Test Suite

Running a Test Suite

1. Running with stock model

2a. Running with function (basic)

2b. Running with function (with context and files)

3. Provide outputs directly

4. Provide custom operators

Other Run Options

After a run is complete

Loading a Test Suite from a file

Using “Right Answers”

Get Started

Web App

CLI and SDK

​Setup

​Creating a Test Suite with the SDK

​Tests with files

​Adding Context

​Adding Tags

​Adding Global Checks

​Advanced Check Modifiers

​Downloading / Pulling a Test Suite

​Updating a Test Suite

​Running a Test Suite

​1. Running with stock model

​2a. Running with function (basic)

​2b. Running with function (with context and files)

​3. Provide outputs directly

​4. Provide custom operators

​Other Run Options

​After a run is complete

​Loading a Test Suite from a file

​Using “Right Answers”

Setup

Creating a Test Suite with the SDK

Tests with files

Adding Context

Adding Tags

Adding Global Checks

Advanced Check Modifiers

Downloading / Pulling a Test Suite

Updating a Test Suite

Running a Test Suite

1. Running with stock model

2a. Running with function (basic)

2b. Running with function (with context and files)

3. Provide outputs directly

4. Provide custom operators

Other Run Options

After a run is complete

Loading a Test Suite from a file

Using “Right Answers”