Once you have created a test suite, you will want to use it to actually test your model. There are a few ways to do this.

Running against a base LLM

If you want to run a test suite against a stock model (e.g. base gpt-4o), you can do so by clicking the “Start Run” button on the upper-right of the test-suite page.

The first set of options allows you to control how the evaluation works.

The second set of options controls the model that is being tested. The dropdown allows you to choose the model that is used to produce the outputs. You can also control the additional parameters for the model.

When ran, it will pull outputs from the model you choose - then run the checks against the outputs.

Run by uploading Input/Output CSV

If you have the outputs from your model collected already, and just want to run the tests against them, you can do so by choosing the “Upload CSV” option.

The CSV should have two columns: Input and Output. The Input column should match the inputs you have in your test suite. The Output column should have the output of your model on these questions.

The system will then run the checks you’ve defined against the outputs in the CSV. Optionally, you can provide a “model name” to help you identify how the outputs in the CSV were produced.

Run using the SDK

If you have the model defined locally in a python script, you can use the SDK to produce the outputs. See the SDK docs for more information.

Parameters

There are a few parameters that you can set to control how the run works.

Evaluation Parameters:

  • Evaluation Model: As part of our evaluation suites, we use LLMs under the hood for checks like includes. By default, we do evaluation with GPT4. However, using this dropdown, you can also choose to use LLama2 (70 billion) or Mistral.
  • Parallelism: This controls how many tests are running at any run time. Set it to a higher number to make the run finish faster. Set it to a lower number if your model has a lower capacity.
  • Heavyweight Factor: Runs each check multiple times to reduce variance in the results. For example, if it was set to 5, each check would be run 5 times, and the final result would be the mode.
  • Run Confidence Evaluation: Whether to compute a confidence score (high or low) for each check.

Model Parameters:

  • Model: The model that is used to produce the outputs.
  • Temperature: The temperature of the model being tested.
  • Max Tokens: The maximum number of tokens the model being tested can output.
  • System Prompt: The system prompt to be passed to the model being tested.

You can also provide a “name” for the run - this is useful to keep track between many different runs.