Skip to main content

Overview

Once you’ve created the test suite, you can run it with the run() function. This will run all the tests in the suite against your model. We support three different ways to produce outputs as you run the suite:
  1. Stock Model: We have as set of models on our platform that you can use, from the likes of OpenAI, Anthropic, Meta, etc.
  2. Function: You can provide us a function that takes in the input to the model (and optionally, files and context) and returns the output of your custom model.
  3. Provide Outputs: You can provide us a list of pairs of inputs and outputs, we will run the evaluation against these outputs directly.

1. Running with Stock Model

The below code will evaluate how gpt-4o-mini performs on the tests you’ve defined.
# create a suite using the steps above

# run the suite with a stock model
run = await suite.run(model="openai/gpt-4o-mini")
print(f"Run URL: {run.url}")

2. Running with Custom Function

Basic Function

You can also provide a custom function - this can contain any RAG pipelines, prompt chains, agentic behavior, etc. For example, here’s a naive model that produces output in the style of a pirate.
gpt_client = OpenAI(api_key=os.environ.get("OPEN_AI_KEY"))

def model_func(test_input: str):
    """ Arbitrary function to represent your 'model' - including RAG pipelines, prompt chains, etc. """
    prompt = "You are a pirate, answer in the speaking style of a pirate.\n\n"
    temp = 0.2

    response = gpt_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt + test_input}],
        temperature=temp,
    )
    return response.choices[0].message.content

# Run the model against the suite, and provide a name for the model for record keeping.
run = await suite.run(model=model_func, model_name="pirate-model-v1")

Function with Context and Files

If you’re using the context and files, you probably want them available to your model function. You can do this by adding them as parameters to your function.
from vals.sdk.util import read_pdf

def model_func(test_input: str, files: dict[str, BytesIO], context: dict[str, Any]):
    # Access context (e.g. message history)
    message_history = context["message_history"]


    # Access files
    for filename, file_content in files.items():
        # NOTE: You can also use your own file parsing / OCR logic here
        file_text = read_pdf(file_content)

    # Query your model
    llm_output = ...

    return llm_output

3. Provide Outputs Directly

You can also provide us a list of the outputs you want to evaluate against. This is useful if you’ve already generated the outputs in some form.
qa_pairs = [
    QuestionAnswerPair(
        input_under_test="What is the MFN clause?",
        llm_output="The MFN clause is a clause in a contract that allows one party to modify the terms of the contract without the other party's consent.",
    )
]


run = await suite.run(
    model=qa_pairs, model_name="precomputed-outputs"
)
NOTE: If you are using this method, the input_under_test field in the QuestionAnswerPair must match the input_under_test field in the test suite. Likewise, if you are using either the context or file features, both the context and files must also match.

4. Provide Custom Operators

You can pass in custom operators to evaluate model outputs using your own criteria.
async def custom_operator(input: OperatorInput) -> OperatorOutput:
    """
    A simple custom operator that checks whether the model output contains the word 'yes'.
    """
    contains_yes = "yes" in input.model_output.lower()
    score = 1.0 if contains_yes else 0.0
    explanation = (
        "The output contains the word 'yes'." if contains_yes
        else "The output does not contain the word 'yes'."
    )

    return OperatorOutput(
        name="contains_yes_check",
        score=score,
        explanation=explanation
    )

async def custom_model(input: str) -> str:
    return "Yes! yes! yes!"

run = await suite.run(
    model=custom_model,
    custom_operators=[custom_operator, custom_operator2],
)

Run Options

There are other parameters you can pass to the run() function to control its behavior, in addition to the model parameter. If you set wait_for_completion=True, the function will block until the run is complete (by default, it will return as soon as the run is started, not when it’s complete). You can also pass a run_name parameter to uniquely identify the run - this is useful if you’re starting many runs of the same test suite, and need a way to disambiguated them. Finally, you can also pass a RunParameters object to the run() function to control more aspects of the run. Some options include:
  • eval_model: The model to use as the LLM as judge
  • parallelism: The number of tests to run at Once
  • heavyweight_factor: Run the auto eval multiple times and take the mode of the results
  • max_output_tokens: If using the first model option above, control the max_output_tokens. Ignored if outputs are provided directly or using a function.
  • system_prompt: If using the first model option above, provide a system prompt to the model.
  • except_on_error: Will raise an exception if the run fails.
  • custom_parameters: Custom parameters to pass to the model. This will be shown in the run result page, even when running with function.

Examples

run = await suite.run(
    model="openai/gpt-4o-mini",
    wait_for_completion=True,
    run_name="my-run-1",
    parameters=RunParameters(
        eval_model="openai/gpt-4o-mini",
        parallelism=10,
        heavyweight_factor=3,
    )
)
run = await suite.run(
    model="openai/gpt-4o-mini",
    wait_for_completion=True,
    run_name="my-run-1",
    parameters=RunParameters(
        parallelism=10,
        max_output_tokens=2048,
        custom_parameters={"top_p": 0.5},
    ),
    except_on_error=True
)

After a Run is Complete

Once a run is complete, you access the results in the Run object. You can access the results of each test in the test_results property, as well as the top-line pass rate, the URL, and other information.
run = await Suite.run(...)

print(f"Status: {run.status}")
print(f"Run URL: {run.url}")
print(f"Pass rate: {run.pass_rate}")
print(f"Timestamp: {run.timestamp}")
print(f"Completed at: {run.completed_at}")


for i, test_result in enumerate(run.test_results):
    print(f"Test {i} Input: {test_result.input_under_test}")
    print(f"Test {i} Output: {test_result.llm_output}")
    # Can also access checks, context, files, auto eval, etc.